tIPOSITION - University of Washington · the sample-basedestimate of future misclassification risk...
Transcript of tIPOSITION - University of Washington · the sample-basedestimate of future misclassification risk...
REGULARIZED GAUSSIAN DISCRIMINANT ANALYSIS
THROUGH EIGENVALUE DECOl\tIPOSITION
by
HaJima BensmailGilles Celeux
TECHNICAL REPORT No. 278
August 1994
Department of Statistics, GN-22
University of Washington
Seattle, Washington 98195 USA
Regularized Gaussian Discriminant Analysisthrough Eigenvalue Decomposition
Halima BensmailUniversite Paris 6
Gilles CeleuxINRIA Rhone-Alpes
Abstract
Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis in the Gaussian framework. RDA makes use of two regularization parameters todesign an intermediate classification rule between linear and quadratic discriminantanalysis. In this paper, we propose an alternative approach to design classificationrules which have also a median position between linear and quadratic discriminantanalysis. Our approach is based on the reparametrization of the covariance matrixBk of a group Gk in terms of its eigenvalue decomposition, L;!~ AkDkAkDk whereAk specifies the volume of Gk, Ak its shape, and Dk its orientation. Variations onconstraints concerning Ak, Ak and Dk lead to 14 discrimination models of interest.For each model, we derived the maximum likelihood parameter estimates and our approach consists in selecting the model among the 14 possible models by minimizingthe sample-based estimate of future misclassification risk by cross-validation. Numerical experiments show favorable behavior of this approach as compared to RDA.
Keywords: Gaussian Classification) Regularization) Eigenvalue Decomposition) 1\:1axittuim. Likelihood.
1 Introduction
a priori probability of belonging to andconditional density of x, (1 :::;: k: :::;: K). Discriminant
their assumptions on the group conditional densrties
The most often applied model, linear discriminant analysis (LDA)group conditional distributions are d-variate normal distributions
mean vectors Ilk and identical variance matrix .E. \Vhen the variance matricesare not assumed to be equal, the model is called quadratic discriminant analysis
(QDA . The parameters 11k and .Ek are usually unknown and must be estimatedfrom a training set consisting in (Xi,Zi),i = 1, ... ,n, where Xi is the vector-valuedmeasurement and is the group of sub jecti. The parameters are generally chosento maximize the likelihood of the training sample. Its lead to the plug-in estimates
x_..:.-:._-" , k=l, ... ,!(, (1.2)
where nk = I{zi k}. And, for LDA,
n(1.3)
or, for QDA,
.Ek
Sk = L_nl"'=l(Xi - Xk)(Xi - XloY (k = 1, ... , K).nk
(1
vF-. '.HUd. ILIV,", classificationFlury ei
covariance matrix
In
Regularization became an important subject of investigation in discriminant analysissince in many cases the size n of the training data set is small in regard to the number d of variables (see Mcl-achlan 1992), and standard methods such as QDA or evenLDA can have a disappointing behavior in such cases. Generally, regularization techniques for discriminant analysis make use of real valued regularization parameters.For instance, one of the most employed regularization technique, the RegularizedDiscriminant Analysis (RDA) of Friedman (1989) specify the value of a complexity
r'HYlpt",r and of a shrinkage parameter to design an intermediate classification rulebetween linear and quadratic discriminant analysis. RDA performs well but do notnr"v1,lp easy interpretable classification
paper, we propose an alternative approach to design
Following jJUHH"~U
1.5
Variations on assumptions on parameters Ak' and (1 :; k: :; K)to 8 models of interest. For msr.ance, we can assume different volumes
shapes and orientations by requiring that = A A unknownD (D for k 1, ... .K,
convention, writingdrscnrnmant model with equal volumes, equal shapes and different orientations.
other families of situation are of interest. The first one consists inassummg that the variance matrices are diagonal matrices. In consideredparametrization, it means that the orientation matrices D,l: are permutation matrices. Since, in such a case, it does not seem that variations on the shape matrices are
any interest, we write~,l: A,l:BJ" where B,l: is a diagonal matrix with IBkl l.This particular parametrization gives rise to 4 models ([AB]' [AkB], [ABk} and [AkBk])'The second family of models consists in shrinking discriminant models by assumingspheri cal shapes, namely I, I denoting the identity matrix. In such a case, twoparsimonious models are in competition: [AI] and [AkI]. Finally, we get 14 differentdiscriminant models.
The method, that we propose and that we called EDRDA (Eigenvalue DecompositionRegularized Discriminant Analysis), consists in selecting the m.l. estimated modelamong the 14 above mentioned models which minimizes the sample-based estimateof future misclassification risk by cross-validation.
Remark: 1: The main motivation of EDRDA is to provide a regularized classificationrule easily interpreted, since it can be analyzed from the volumes, the shapes and theorientations of the groups.
Remark 2: Our selection procedure (the cross-validated error rate) has been proved toprovide good performances for selecting models in discriminant analysis (e.g. Friedman 1989).
Remark; 5J: EDRDS generalizes the approach of Flury et al. (1994) which analyzedperformance of models [ADAD'], [AkDAD'], [AkDAkD'], and [AkDkAkDU and sug
ge:3tec1 to choose among different models with the cross-validated error rate.
Section
on the basis of Monte Carlo simulations a discussion section
2 Regularized Discriminant Analysis
regularized discriminant analysis of Friedman (1989) makes use of a complexityparameter Ct and of a shrinkage parameter I in the following way. RDA replacesplug-in estimator (1.4) of with
(2.6)
where
(2.7)
Thus a (0 :::; Ct :::; 1) controls the amount the Sk are shrunk towards S, while I(0 :::; ~I :::; 1) controls the shrinkage of the eigenvalues towards equality as tr[I:k(a)]jpis equal to the average of the eigenvalues of I:k(Ct).
The parameters in (2.6) and (2.7) are chosen to minimize jointly the cross-validatederror rate. Friedman proceeds in the following way. A grid of candidate (Ct, ~/)
pair is first selected on the unit square. The cross-validation is then employed toobtain a (nearly) unbiased estimate of the overall error rate for the discriminant ruleassociated with each (Ct, I)-pair on the grid. Then, RDA chooses the point (Ct, ~/)
with the smallest estimated error rate.A characteristic of the RDA approach, pointed out Rayens and Greene (1991), isthat optimal value of the cross-validated error r<tte rarely occurs at a single pointof the grid, but for a large range of values of (Ct, ~/). The RDA procedure resolve tiesby selecting first the points with the largest value of Ct (parsimonious principle), andthen the point with the largest value of I (shrinking principle).RDA provides a fairly rich class of regularization alternatives. Holding", fixed at 0
F'O,e,nrln' Ct produces models between QDA and LDA. While, holding Ct fixed at 0creasing ~I attempts to unhias the sample-based eigenvalues estimates. Holding
increasing gives rise to the ridge-regression analog for LDA. Thexperiments in Friedman 1 RDA in many
Cl rcurnstances as compared
3 Maximum likelihood estimation of the models
Table 1: Some characteristics of the 14 models. \Ve have a = K d + K - 1 and; CF means that the m.l. estimates are closed form. IP means the m.l.
'AOUIUl<t'\,lVU needs an iterative procedure.
[/\DAkD'][AkDfhD'][ADkADU
(>..kDkADU[ADkAkDk][AkDkAkDkJ
+a+f3+(K- (d 1)
a+/3+(K l)da +K f3 (K - l)d
a+Kf3-(K l)(d 1)a+Kf3 (K-l)
a+Kf3
IPIPCFIPCFCF
[AB][Af,B][ABk][AkBk]
a+da+d+K-l
a+Kd K+la+ Kd
CFIPCFCF
a+la+K
CFCF
parameters to be estimated. The third column indicates if the m.l. estimates can beachieved with closed form formulas (CF) or if there is the need to make use of aniterative procedure (IP).For model, the estimation of the group mean vectors (fJk, k = 1, ... ,K) is
X'tfJk Xk = -'--"---- (3.8)
where ru; #Gk in the learning sample.
variance matrices of groups depends on the modelto closed form formulas but most of the
nrc,,",,,n procedure to
estimation ofIn some cases, itto use an
at
of group
L (Xi- r= 1, ... , i). (3.10)
Model L\DAD'l. This is the classical linear discriminant analvsis model. This model, • u
is Q = 1 and 0 in the RDA scheme. The common variance matrixI; is estimated by
A WE=-.
n
JVlodel [AkD/iD'). In this situation, it is convenient to write = AkC withC DAD'. This model has been considered and called the proportional covariance matrices model by Flury (1988). The estimation of the AI/s and C need aniterative procedure.
• As the matrix C is kept fixed, the Ak'S are solution of the equations (1 :::; k :::; K)
tr(WkC-1)
dnk
• As the volumes Ak'S are kept fixed, the matrix C maximizing the likelihood is
C
Model [ADAkD']. In this situation and in the next one, there is no interest toassume that the terms of the diagonal matrices Ak are in decreasing order. Thusfor the models (.\DAkD'] and [AkDAkD'] we do not assume that the diagonal termsof are in decreasing order. The m.l. estimates of A, D and k: 1, ... .K)are derived using an iterative method, that we describe hereunder, and by a directcalculation of A,
• For D, compute
•
ql and q2 are two orthonormal vectors of R 2
associated to the smallest eigenvalue of matrix
(de, d., (de, d m ) .
algorithm is it no increase of likelihood.
DA/jYj. In this situation, it is convenient to = D IhD' nrC,DY'"'
I. This model has been considered and called the common principal com-IJlJ,'ILIH.' model by Flury (1 . The algorithm for deriving m.1. estimates ofD, AI, ... , AId is similar to the previous one:
• For fixed D, we get
• For fixed AI, ... , AK, D is obtained using the same procedure as described formodel [ADAkD'J.
Model [ADkADkJ. Considering for k = 1, ... .K the eigenvalue decomposition J;VkLkfhL~ of the symmetric definite positive matrix Wk with the eigenvalues in thediagonal matrix fh in decreasing order, we get
Dk = Li; k = 1, ... .K,
and
A=
n
Mo del lll}i;]. We use again the eigenvalue decomposition J;Vk LkDkL~.
Parameters Ak, Dk and A are solutions of the equations, to be solved iteratively,
<
l\1odel is the most situation corresponding to ordinarydiscriminant analysis. This model is obtained with [)' 0 and 0 in the
RDA The m.l. of variance are
now present the m.l. estimates for models with diagonal variance matrices. Formore parsimonious family of models, the eigenvectors of (1 S; k: S; K) are
the vectors generating the basis associated to the d variables (Dk = . If the J k
are equal, the variable are independent. If the Jk are different, the variables areindependent conditionally to the groups to be classified.
Model [AE]. In this situation, we get
dia<Y(W~)E- b.
- Idiag(W)I~
andIdiag(lV)I~
n
Model [AleE]. In this situation, the m.l. estimates are derived from the followingiterative procedure:
• As the matrix E is kept fixed, the Ak'S are
• As the volumes Ak'S are kept fixed, the matrix E is
E
Model . In srtuatron , we
<
Mode]
We iii,,"'IFC>
------ (1 S k S K).
variance matrices are spnerrcar.= )'kI, I denotma
volumes of
Model . This model is obtained with a = 1 and ~( 1 in the ROA scheme. Ithas been called the nearest-means classifier by Friedman (1989). In this situation,we
tr(vV)A = nd'
Model [AJJ]. In this situation, we get
4 Numerical experiments
We now present Monte Carlo simulations to compare ROA and EOROA. vVe essentially used the same simulation scheme as Friedman (1989). We called 01-05 thesimulated data structures for dimensions d 6 and d = 20 and sample size n 40.Those data structures are respectively corresponding to Tables 2-6 of Friedman'spaper. Roughly speaking, 01 provides spherical groups with different volumes andmeans; 02 and 03 provide ellipsoidal groups with same shapes and orientations, withpoorly separated means for 02 and well separated means for 03; 04 and 05 provideunequal ellipsoidal groups with equal means for D4 and different means for 05. Moreprecisely the simulated distribution parameters were the following.
D1
P~ (0, 0, 0, , 0)1£; (0,3,0, ,0)P:3 = (0, 0, , 0)
2..:: 1 = I213/
D
d',- J .1 <' < d
I' - J -
(-1 1 :S j :S d,
D3. group mean vectors are
/1; (0".,,0). 1
II 2J; J - • 1 < J' < d
t" -1 - -
(-1 1 :S j :S d,
Data set D2 and D3 are related to models [ABJ and ['xDADtJ,
The variance matrices for data sets D4 and D5 are identical. They are diagonal butdifferent for each group, For the group G1 , the general diagonal term is
For group G2 , it is
a2j = [9(d - j)/(d - 1) + 1J2, 1:S j :S d.
And, for group G3 , it is
a3.i [9[j-(d-1)/2]j(d-1)]2, l:Sj:Sd.
The group mean vectors are equal for D4 and are for D5
/11
/12j
/13.i
(0, ... ,0)
14/01,( -1).i /12j
j = 1, .d
j = 1, .d.
As Friedman, for each simulated data structure, we used an additional test sample of100 to obtain an estimate of the compared classification rules. The experiments
are summarized in Tables 2-8. Tables 2 and 3 gives the means error rates andparentheses) the standard deviations of the error rates. Table displays the
mean of the complexity the shrinkage parameters of RDArespective standard deviations into parentheses. frequencies
W~A t~
•
Mean error rates on the test sample for RDA and EDRDA, using theparsimomous or the complex (com for dimension d = 6.
Table 3: Mean error rates on the test sample for RDA and EDRDA, using theparsimonious strategy (par) or the complex strategy (com) for the dimension d 20.
I d = 20 I error rate(RDA) error rate(EDRDA):par error rate(EDRDA):comDI 0.10(0.05) 0.05(0.02)
,0.06(0.03)
D2 0.27(0.07) 0.14(0.03) 0.18(0.03)D3 0.14(0.04) 0.05(0.05) 0.09(0.06)D4 0.12(0.0.5) 0.20(0.06) 0.25(0.04)D5 0.06(0.04) 0.03(0.05) 0.06(0.03)
4: Mean values of the complexity parameter a and of the shrinking parameterforRDA.
:3: Frequencies of6 .
selected model for EDRDA using parsimonious strat-
Monel Dl D2 D3 D4 D5[/\DADt] 3 9 :1 0 0
[/\k DADt] 65 2 1 0 0
I [)..DA/,Dt] 0 1 i 2 i 0 0
I [)..kDihDt] 2 11 2 39 9[)..DvlDkJ 7 0 I 0 0 0 I[)..kDkADi] 4 0 0 6 5
,
[)"DkAkDt] 2 0 0 1 3[)..kDkAkDi] 0 0 0 0 0
[),,1] 12 0 4:3 0 0[)..k 1] 2 0 11 0 0[)"E] 0 63 27 0 0
[)..kE] 2 9 rr 1 1I
[)..Ek] 1 12 3 49 59[)..kEk] I 0 0 0 11 23
Table 6: Frequencies of the selected model for EDRDA using the complex strategy(d = 6).
Table 7: Frequencies of the selected model for EDRDA using
Moriel D1 D2 D3 D4 D5[XDAn t] 2 6 2 0 0
I [Ak DADt] 4 0 ! 2 I 0 0 ,
[ADAkDt] 0 I 0 0 6 2
[AI,DAkDt] 0 0 0 0 0
[ADkADk] 0 0 0 0 0[Ak DV 1Dt] 0 0 , 0 0 0[ADkAkDk] 0 0 0 0 0[ADkihDU 0 0 0 0 0
[AI) 44 0 32 0 0[Ak I) 36 0 0 0 0[AB] 8 74 44 0 0
[Ak B] 6 6 12 0 2[ABk] 0 8 8 94 96[AkBk] 0 6 0 0 0
parsimonious strat-
Table 8: Frequencies of the selected model for EDRDA using the complex strategy(d = 20).
• surpnsmgly, complex resp. parsimomous ofEDH.DA tends to provide smaller error rates for d 6 d = 20). But.
d = 6 the advantage of the complex strategy ]s not so marked, and, oncontrary, the of the parsimonious can
d = 20.
• From Tables 5 and 7, it appears that parsimornous of EDRDAselected reasonable models among the possible models. However. and notsurprisingly, this strategy has a tendency to select too simple models especially
the groups are well separated. In such cases, the criterion of selecting themodel, namely the cross-validated error rate, can indicate that simpler modelsprovide a quit performing classification rule (as for data set D5 with d 6,where the model PEJ is preferred to the model [)\kDJ) .
• From Tables 6 and 8, it appears that the complex strategy can select reasonablemodels (see for instance the selected model for D5 with d = 20), but it can givealso some disconcerting choices, as the model [AkDAkD'J for the data set D2with d = 6 or the models [AkE] and [AEkJ for the data set D3 with d = 20.
• As a consequence, the parsimonious strategy can be preferred to the complexstrategy: It gives often better error rates and moreover, it provides more realisticor reliable models in most cases.
5 Discussion
We have proposed a regularization approach, EDRDA, for Gaussian discriminantanalysis based on the eigenvalue decomposition of the group variance matrices. Oneof the main interest of this approach is to provide a clear classification rule. Thereported numerical experiments show that EDRDA can be expected to perform as
as well as RDA by producing a more user friendly classification rule. Moreover,in our opinion, the usefulness of EDRDA is not reduced to a small sample size setup
can provide quite performing classification rules where LDA and QDA give poorerror rates.
References
models.parSImOnIOUSGovaert,t-ratte-rn. Recognition, to appear.
B. \iV. (1 . Common principal components in k groupS. Journal ofAmerican ouuisucai Association, 79, 892-897.
B. (1 Common principal components and Related -rnu t tnivrsr-r n t:
New York: John Wiley.models.
Flury, B. W.; Cautschi, \iV. (1986). An algorithm for simultaneous orthogonaltransformation of several positive definite symmetric matrices to nearly diagonalform. SIA1V! Journal of Scientific Statistics Computation, 7, 169-184.
Flury, B. W., Schmid, M. J. and Narayanan, A. (1993) Error rates in quadraticdiscrimination with constraints on the covariance matrices. Journal of Classification, to appear.
Friedman, J. (1989). Regularized Discriminant Analysis. Journal of the AmericanStatistical Association 84, 165-175.
Rayens, W. S. and Greene, T. (1991). Covariance pooling and stabilization forclassification. Computational Statistics & Data Analysis, 11, 17-42.
Mcl.achlan , G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. New York: John Wiley.
SECUR~TY CLASSIFICATION OF THIS PAGE (When D",,, En'ered)
REPORT DOCUMENTATION PAGE READ INSTRUCTIONSBEFORE COMPLETING FORM
1. REPORT NUMBER
4. TITLE (.and Sub'itle)
278rz, GOVT ACCESSION N0'1 3. RECIPIENT'S CATAL.OG NUMBER
S. TYPE OF REPORT & PERIOO COVE:REO
REGULARIZED GAUSSIAN DISCRIMINk~T ANALYSISTHROUGH EIGENVALUE DECOMPOSITION
6. PERFORMING ORG, REPORT NUMaER
7. AU THOR(.)
Halima BensmailGilles Celeux
9. PE:RFORMING ORGANIZATION NAME AND ADDRESS
Dept. of Statistics GN-22University of WashingtonSeattle, WA 98195
11. CON1"ROl.L1NG OFFICE NAME ANO ADDRESS
ONR Code N633741107 NE 45th
16. DISTRIBUTION STATEMENT" (of thi. Report)
N00014-91-J-I07410. P~OGRAM ::~;;:MENT. P~OJE:'::';'. 1"ASK
AREA'" WORl( UNIT NUMaaRS
NR-661-003
12. REPORT OA1"5
August 199413. NUMBER OF PAGES
15IS. SECU RITY CLASS. (of 'hi .. ,eport)
Unclassified
IS". DECLASSIF1CAT"IONI DOWNGRADINGSCHEOUL.E
APPROVED FOR PUBLIC RELEASE: ~ISTRI8UTION UNLIMITED.
17. DISTRISUnON STAT"EMENT (of the ..o.""c/ en'ered In Block 20, If <ill/e,en' froar Report)
la. SUPPl.EMENT"ARY NOTES
19. KEY HOROS (Cantinu. on tever:tt/l 'lid_ Jf rrec•••-.ry" «td ldttnttty by block numb...)
Gaussian Classification, Regularization, Eigenvalue Decomposition,Maximum Likelihood
20.
Variations oninterest.