Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

9
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 4, APRIL 2010 1851 Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification Are C. Jensen, Member, IEEE, Marco Loog, and Anne H. Schistad Solberg, Member, IEEE Abstract—An important component in many supervised clas- sifiers is the estimation of one or more covariance matrices, and the often low training-sample count in supervised hyperspectral image classification yields the need for strong regularization when estimating such matrices. Often, this regularization is accom- plished through adding some kind of scaled regularization matrix, e.g., the identity matrix, to the sample covariance matrix. We introduce a framework for specifying and interpreting a broad range of such regularization matrices in the linear and quadratic discriminant analysis (LDA and QDA, respectively) classifier settings. A key component in the proposed framework is the relationship between regularization and linear dimensionality re- duction. We show that the equivalent of the LDA or the QDA classifier in any linearly reduced subspace can be reached by using an appropriate regularization matrix. Furthermore, several such regularization matrices can be added together forming more com- plex regularizers. We utilize this framework to build regulariza- tion matrices that incorporate multiscale spectral representations. Several realizations of such regularization matrices are discussed, and their performances when applied to QDA classifiers are tested on four hyperspectral data sets. Often, the classifiers benefit from using the proposed regularization matrices. Index Terms—Covariance matrix regularization, dimension- ality reduction, hyperspectral image classification, pattern recognition. I. I NTRODUCTION I N SUPERVISED hyperspectral image classification, the high number of spectral bands is often a mixed blessing in that it yields the potential of highly discriminative classi- fiers, while the usually low sample count makes it necessary to severely restrain the complexity of the classifiers. Further adding to the problem is the high correlation between the spec- tral bands (features), a correlation that can reduce the overall amount of information available and makes interband covari- ance estimation critical for building efficient classifiers [1]. In various approaches to supervised statistical classification, a probability distribution for each class is estimated, and new pixels are assigned to the class, giving the highest (possibly weighted) class-conditional probability. However, in the case of hyperspectral data, which have hundreds of features, even Manuscript received January 16, 2009; revised July 9, 2009. First published January 19, 2010; current version published March 24, 2010. M. Loog received financial support from the Innovational Research Incentives Scheme of the Netherlands Research Organization [NWO, VENI-Grant 639.021.611]. A. C. Jensen and A. H. S. Solberg are with the Department of Informatics, University of Oslo, 0316 Oslo, Norway. M. Loog is with the Pattern Recognition Laboratory, Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Tech- nology, 2628 Delft, The Netherlands. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2009.2036842 simple probability distributions like the Gaussian normal dis- tribution have too many degrees of freedom and hence lead to estimates which generalize poorly. The estimates tend to overfit the data, and the need to either reduce the dimensionality, re- strain the distributions further, or regularize becomes apparent. In this paper, we focus on classifiers based on the Gaussian distribution or, more specifically, on the regularization of its key component: the covariance matrix. One common approach for regularizing such matrices is to add some regularization matrix, e.g., a scaled identity matrix, to the sample covariance estimate. We develop a framework for specifying and interpreting such regularization matrices in linear and quadratic discriminant analysis (QDA and LDA, respectively) settings. The introduced framework is based on bridging the gap between regularization and linear dimensionality reduction. The basic idea is to first define a linear subspace, then let the regularized QDA or LDA classifier approach the corresponding classifier found in this reduced subspace when the regular- ization is increased. That is, one can continuously go from the classifier found in the full dimension to the classifier one would get if one first did a linear dimensionality reduction. Additionally, several of these regularization matrices can be added together, forming more complex regularizers. As we will see, deriving the matrix estimates through maximizing a posterior distribution facilitates the interpretation of the latter through the combination of independent a priori components. We use this introduced way of interpreting additive regular- ization matrices to construct regularization matrices based on information from spectra at multiple scales or, more accurately, we use the information on how to segment the spectra at dif- ferent spectral resolutions in an attempt to build more suitable regularization matrices for hyperspectral data. The behavior of the classifiers using the derived regularizers is studied using real hyperspectral data sets. Section II summarizes the basic concepts behind the QDA and LDA classifiers and gives a brief review of related work. Section III introduces the basic theory of how to interpret regularization matrices in light of dimensionality reduction, and in Section IV, we give an example of how this theory can be the basis for incorporating information from multiscale representations of the spectral curves when forming regulariza- tion matrices. In Section V, we show and discuss results from experiments on real hyperspectral image data before we give some concluding remarks in Section VI. II. BACKGROUND Before introducing the contribution of this paper in detail, we start by describing the QDA and LDA classifiers, how they are typically regularized, and discuss some related work. 0196-2892/$26.00 © 2010 IEEE

Transcript of Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

Page 1: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 4, APRIL 2010 1851

Using Multiscale Spectra in Regularizing CovarianceMatrices for Hyperspectral Image Classification

Are C. Jensen, Member, IEEE, Marco Loog, and Anne H. Schistad Solberg, Member, IEEE

Abstract—An important component in many supervised clas-sifiers is the estimation of one or more covariance matrices, andthe often low training-sample count in supervised hyperspectralimage classification yields the need for strong regularization whenestimating such matrices. Often, this regularization is accom-plished through adding some kind of scaled regularization matrix,e.g., the identity matrix, to the sample covariance matrix. Weintroduce a framework for specifying and interpreting a broadrange of such regularization matrices in the linear and quadraticdiscriminant analysis (LDA and QDA, respectively) classifiersettings. A key component in the proposed framework is therelationship between regularization and linear dimensionality re-duction. We show that the equivalent of the LDA or the QDAclassifier in any linearly reduced subspace can be reached by usingan appropriate regularization matrix. Furthermore, several suchregularization matrices can be added together forming more com-plex regularizers. We utilize this framework to build regulariza-tion matrices that incorporate multiscale spectral representations.Several realizations of such regularization matrices are discussed,and their performances when applied to QDA classifiers are testedon four hyperspectral data sets. Often, the classifiers benefit fromusing the proposed regularization matrices.

Index Terms—Covariance matrix regularization, dimension-ality reduction, hyperspectral image classification, patternrecognition.

I. INTRODUCTION

IN SUPERVISED hyperspectral image classification, thehigh number of spectral bands is often a mixed blessing

in that it yields the potential of highly discriminative classi-fiers, while the usually low sample count makes it necessaryto severely restrain the complexity of the classifiers. Furtheradding to the problem is the high correlation between the spec-tral bands (features), a correlation that can reduce the overallamount of information available and makes interband covari-ance estimation critical for building efficient classifiers [1].

In various approaches to supervised statistical classification,a probability distribution for each class is estimated, and newpixels are assigned to the class, giving the highest (possiblyweighted) class-conditional probability. However, in the caseof hyperspectral data, which have hundreds of features, even

Manuscript received January 16, 2009; revised July 9, 2009. First publishedJanuary 19, 2010; current version published March 24, 2010. M. Loog receivedfinancial support from the Innovational Research Incentives Scheme of theNetherlands Research Organization [NWO, VENI-Grant 639.021.611].

A. C. Jensen and A. H. S. Solberg are with the Department of Informatics,University of Oslo, 0316 Oslo, Norway.

M. Loog is with the Pattern Recognition Laboratory, Faculty of ElectricalEngineering, Mathematics, and Computer Science, Delft University of Tech-nology, 2628 Delft, The Netherlands.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2009.2036842

simple probability distributions like the Gaussian normal dis-tribution have too many degrees of freedom and hence lead toestimates which generalize poorly. The estimates tend to overfitthe data, and the need to either reduce the dimensionality, re-strain the distributions further, or regularize becomes apparent.

In this paper, we focus on classifiers based on the Gaussiandistribution or, more specifically, on the regularization of its keycomponent: the covariance matrix. One common approach forregularizing such matrices is to add some regularization matrix,e.g., a scaled identity matrix, to the sample covariance estimate.We develop a framework for specifying and interpreting suchregularization matrices in linear and quadratic discriminantanalysis (QDA and LDA, respectively) settings.

The introduced framework is based on bridging the gapbetween regularization and linear dimensionality reduction.The basic idea is to first define a linear subspace, then let theregularized QDA or LDA classifier approach the correspondingclassifier found in this reduced subspace when the regular-ization is increased. That is, one can continuously go fromthe classifier found in the full dimension to the classifier onewould get if one first did a linear dimensionality reduction.Additionally, several of these regularization matrices can beadded together, forming more complex regularizers. As wewill see, deriving the matrix estimates through maximizing aposterior distribution facilitates the interpretation of the latterthrough the combination of independent a priori components.

We use this introduced way of interpreting additive regular-ization matrices to construct regularization matrices based oninformation from spectra at multiple scales or, more accurately,we use the information on how to segment the spectra at dif-ferent spectral resolutions in an attempt to build more suitableregularization matrices for hyperspectral data. The behavior ofthe classifiers using the derived regularizers is studied using realhyperspectral data sets.

Section II summarizes the basic concepts behind the QDAand LDA classifiers and gives a brief review of related work.Section III introduces the basic theory of how to interpretregularization matrices in light of dimensionality reduction,and in Section IV, we give an example of how this theorycan be the basis for incorporating information from multiscalerepresentations of the spectral curves when forming regulariza-tion matrices. In Section V, we show and discuss results fromexperiments on real hyperspectral image data before we givesome concluding remarks in Section VI.

II. BACKGROUND

Before introducing the contribution of this paper in detail, westart by describing the QDA and LDA classifiers, how they aretypically regularized, and discuss some related work.

0196-2892/$26.00 © 2010 IEEE

Page 2: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

1852 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 4, APRIL 2010

Let y be a column vector containing the band values ofa single pixel. Then, when assuming that the data in eachclass follow a normal distribution, the discriminant functionsminimizing the Bayes error rate become [2]

gc(y) = −12

log |Σc| − 12(y−μc)′Σ−1

c (y − μc) + log πc (1)

where μc, Σ−1c , and πc are the mean vector, the inverse co-

variance matrix (also called the precision matrix), and a prioriprobability for class c, respectively. That is, a new sample(pixel) y∗ will be classified to the class c, giving the highestvalue of gc(y∗).

Most often, neither the mean vectors nor the covariancematrices are available, and hence, they have to be estimatedfrom the data. Throughout this paper, μc is estimated using thesample mean. The sample covariance matrix,

Σ̃c =1

Nc

Nc∑i=1

(yci − μc)(yci − μc)′

where Nc is the number of samples in class c and yci isthe class’ ith training sample, is also the maximum-likelihoodestimator when assuming normally distributed data, and byfitting their inverses into the discriminant functions of (1), weget what is often referred to as the (traditional) QDA classifier.The problem with these estimates is that they become unstableand quickly lose generalization performance as the number ofsamples involved in estimating them goes down.

One way to palliate this problem is to assume that the classesall share the same covariance matrix, thereby having more sam-ples available to estimate this one covariance matrix. By doingso, we end up with linear decision boundaries and a classifierknown as the LDA classifier. The discriminant functions in (1)are, in this case, reduced to

gc(y) = −12(y − μc)′Σ−1(y − μc) + log πc (2)

where Σ−1 is the common precision matrix for all the classes.Friedman [3] proposes to use a linear combination of the

per-class estimates and a pooled covariance matrix estimate,hence forming a mixture of the more complex QDA and the lesscomplex LDA classifiers, in the hope of finding a suitable com-promise matching the available amount of data. Furthermore, headds a scaled identity matrix, hence dampening the high impactthe low eigenvalues have on the discriminant functions. That is,he uses in (1) the estimates

Σ̂c = (1 − α)Σ̃c + α((1 − γ)Σ̃ + γσ2I

)

where Σ̃c and Σ̃ are the class-specific and pooled samplecovariance matrix, respectively, and α and γ are regularizationparameters operating in the interval between zero and one (see[4] for more on this and similar regularization methods).

Hastie et al. [5] focus on LDA and develop what they callpenalized discriminant analysis (PDA) through recasting theLDA classifier in a penalized-regression (optimal scoring) set-ting, in which they penalize coefficients that are (in some sense)spatially varying. They show that this PDA classifier can also be

achieved by regularizing the common sample covariance matrixby the same penalization matrix used in the regressions, andclassify to the class of the lowest Mahalanobis distance, i.e.,they use the estimate

Σ̂ = Σ̃ + λΩ, λ > 0 (3)

where Σ̃ is the common sample covariance matrix and Ω isthe matrix used to spatially penalize the regression coeffi-cients, in the discriminant functions (2). They interpret thenew covariance matrix estimate by noting that input vectorswith highly varying coefficients (spatially) are given low valuesin the resulting Mahalanobis distances. By setting Ω equal tothe identity matrix, we get what is sometimes referred to as a“ridge classifier” due to the analogy with ridge regression. Whatregularizing with an identity matrix does is to ensure that smalleigenvalues are not unproportionately inflated when calculatingthe precision matrix, and also biases the probability distributiontoward a sphere, thereby making the classifier act more like anearest mean classifier.

Other approaches for estimating large covariance matriceswhen the sample count is limited include different regressionapproaches [1], [6], [7], imposing band structure on the ma-trices [8], [9], and graphical lasso [10] which uses lasso re-gression on the inverse-covariance matrix coefficients directly.Descriptions of key approaches when applying regularizationof covariance matrices in hyperspectral image settings can befound in [11]–[14].

III. INTERPRETING REGULARIZATION MATRICES

In this paper, we focus on matrix regularizations of the form

Σ̂c = Σ̃c + λR (4)

where λ > 0 and R is what we call the regularization ma-trix. Much of the theory presented in this paper is based on,and/or inspired by, the observation that by using appropriateR matrices in the equation above, one can make the resultingGaussian classifiers behave as though one first has applied alinear feature reduction. In what follows, we start by definingmore rigorously what we mean by this statement, before weprove it and give some interpretations of (4) in the context ofmaximizing a posterior probability.

Let V be a p × k matrix defining a linear feature reductionfrom p to k dimensions, i.e., z = V ′y makes z contain the (new)k features, and let V⊥ have columns spanning the null-space ofV V ′. Now, if we use the regularization matrix R = V⊥V ′

⊥ in(4), the QDA (or LDA) classifier will, as λ → ∞, approach andbecome equal to the QDA (or LDA) classifier obtained afterapplying the dimensionality reduction transform V . This meansthat as λ → ∞, letting gcλ be the full-dimensional discriminantfunction in (1) using the regularized covariance estimate in (4)and gck, its k-dimensional analog

gcik(y∗) > gcjk(y∗) ∀ cj �= ci

lim

λ→∞gciλ(y∗) > lim

λ→∞gcjλ(y∗) ∀ cj �= ci

Page 3: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

JENSEN et al.: USING MULTISCALE SPECTRA IN REGULARIZING COVARIANCE MATRICES 1853

where y∗ is an arbitrary pixel value. The proof of this is oneof the central contributions of this paper and can be found inthe Appendix I. In the Appendix, we actually show the strongerstatement that limλ→∞ gciλ − gcjλ = gcik − gcjk for any twoclasses ci and cj . Referring to the discriminant functions in(1), we see that for this to happen, both the differences inMahalanobis distance, as well as the ratios of the determinantsof the two covariance matrix estimates, must approach eachother.

Note that V⊥V ′⊥ is a matrix whose nonzero eigenvectors

span the null-space of V V ′, i.e., if V is orthogonal, thenV⊥V ′

⊥ = I − V V ′. As shown in the Appendix II, any linearfeature reduction can be made orthogonal without affecting theresulting LDA or QDA classifier, and henceforth, V is assumedorthogonal without loss of generality.

In our hyperspectral setting, by letting V be a linear featurereduction onto a “lower resolution” spectral subspace, we canmake the full-dimensional/full-resolution classifier graduallybecome more like the classifier one would get in the low-resolution case.

Although we have proved algebraically that the regularizedclassifier approaches the classifier obtained in the linearly fea-ture reduced space, it can be useful to give an interpretation inthe context of maximizing a posterior distribution, as is oftendone in the setting of the traditional ridge regression. Thisalso elucidates compounding several regularization matrices, aswe will do when adding information from spectra at multiplescales, and even paves the way for a new interpretation ofthe PDA.

A. Derivation Through the Mode of a Posterior Distribution

Let us assume that di ∼ N(y′iβ, σ2) and β ∼ N(0, τ−1I),

where yi is our ith p-dimensional sample and I is the p × pidentity matrix. Then, as is well known, maximizing the poste-rior probability

p(β|D) ∝ p(D|β)p(β)

where p(D|β) is the likelihood for observing our di’s, willgive the classic ridge solution, together with the “induced”covariance matrix estimate for y equal to (4) with R = Iand λ ∝ τ , [15]. If we, on the other hand, assume a priordistribution with a precision matrix equal to R = I − V V ′, i.e.,having a covariance matrix R∞ = R + γV V ′, γ → ∞, makingβ ∼ N(0, τ−1R∞), we would end up with a solution where (4)can be construed as the covariance matrix estimate. Colloqui-ally, we can say that β has an improper prior distribution thatforces the solution into the subspace spanned by V .

B. Compound Matrices

We can also form more complex regularization matrices byweighting and adding together several simpler ones

R =n∑

i=1

wiRi (5)

where each Ri = I − ViV′i and Vi is a linear reduction trans-

form matrix. In light of Section III-A, this can be interpreted asletting the prior distribution of β be consisting of multiple inde-pendent components, p(β) = p1(β), p2(β), . . . , pn(β), or thateach posterior again has a new prior distribution, i.e., a nestedvariant: p(β|D) ∝ (. . . ((p(D|β)p1(β))p2(β)), . . . , pn(β)).

Informally, we can say that by increasing the weight wi, weare “pulling” the resulting classifier toward the classifier onewould get using the Vi dimensionality reduction transform.

C. Interpreting the PDA Regularization Matrix

The PDA is, as we discussed in the introduction, developedthrough recasting the LDA classifier in a multiple penalized-regression setting. The regularization (penalization) matrix Ωthat is used is often one which penalize spatial roughness inthe regression coefficients, i.e., the penalization term, β′Ωβ,becomes large when neighboring values in the regression coef-ficient vector β differ. When applied to data which are sampledsignals, a common choice of Ω in (3) is a p × p tridiagonalmatrix with all the elements along the main diagonal equal toone and all the elements directly above and below the maindiagonal equal to −(1/2). Let us denote such a matrix RPDA.

If we let di denote the ith component of the length p discretecosine transform, then [16]

RPDA =p∑

i=1

widid′i (6)

where wi = 1 − cos(π(i − 1)/p). This can of course be rewrit-ten in the form of (5)

RPDA =p∑

i=1

wiRi, Ri = Vi⊥V ′i⊥

where Vi = I − did′i. In light of the interpretations of com-

pound matrices in Section III-B, we can conclude that theresulting classifier is pulled more toward the classifiers onewould get when doing linear feature reductions onto the low-frequency components. This interpretation not only holds forthe original LDA classifier but also for the QDA. That is, thetridiagonal PDA regularization matrix can be applied directlyin a QDA classifier, and the result is interpretable within ourframework.

IV. ADDING MULTISCALE SPECTRAL INFORMATION

In [17], we developed an approach for representing thespectral curves using a reduced number of bands. The cen-ter wavelengths, spectral widths, and intensity values of thenew bands were optimally chosen to minimize the square-representation error. The locations of the new broad-coveringspectral bands were global, in that they were shared by allspectra in all classes. For each “resolution,” k, the dynamicprogramming algorithm returned the breakpoints that signaledthe locations of k new broad-covering bands or segments. Byvarying the number k, one gets a multiresolution representationof the spectra with sharp transitions between the bands. An

Page 4: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

1854 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 4, APRIL 2010

Fig. 1. Example of a spectral curve from the KSC data set representedusing different number of segments, i.e., using different scale representations.From top to bottom: using 5, 10, 37, 73, and 145 (maximum, shown in bold)segments. The curves are vertically shifted for clarity.

example of such a multiscale representation of a spectral curveis shown in Fig. 1. In [17], we chose one such scale, basedon crossvalidation (CV) on the classification error, and usedthe mean values in the segments as classifier features. Thatis, we used the location of the breakpoints to build a lineardimensionality reduction transform that merged the p originalspectral bands into k broad-covering ones.

The idea here is to combine the information of where thesegment breakpoints are located at different scales into buildinga regularization matrix that, loosely speaking, forces the bandsthat are most often in the same segment to act as if they were(almost) averaged into one feature before applying a classifier.

Letting Vk be a p × k matrix representing the linear di-mensionality reduction into the subspace spanned by the kbroad-covering spectral bands, one can say that adding Rk =Vk⊥V ′

k⊥ to the sample covariance matrix will make the resultingclassifier behave more like the classifier one would get in thek-dimensional subspace. As discussed in Section III-B, thiscovariance matrix estimate can again be augmented by addingRl = Vl⊥V ′

l⊥, l �= k, which can be interpreted as pulling theresulting classifier toward the classifier one would get in thel-dimensional subspace. By weighting and adding such matri-ces for all representation scales, we can form

R =p∑

k=1

kγRk, Rk = Vk⊥V ′k⊥ (7)

where kγ is the weighting term. A natural choice is γ > 1 whichgives more emphasis on the Rks that have low rank and hencekeep “larger” subspaces unaffected, i.e., more emphasis on thedimensionality reductions that merge fewer (but hopefully morereliably chosen) neighboring bands. Examples of Rk matricesdisplayed as images are shown in Fig. 2.

A. Modifying the PDA Regularization Matrix

The multiscale representations consist of segmenting thespectral curves into contiguous intervals. At level k, whichdivides the curve into k segments, we can penalize each of the

Fig. 2. (a) and (b) “One-scale” matrices R10 and R73, respectively, from(7) calculated using the KSC data set. The sum of each row is zero, and theelements outside the squares along the diagonals are all zero. Note that thelocations of the squares correspond to those of the constant segments in Fig. 1.(c) and (d) Excerpts from the combined R matrices formed in (7) and (8),respectively, again created using the KSC set. Note that the R in (7) is theweighted sum of simpler matrices such as those depicted in (a) and (b). (e) and(f) Plots of the diagonals of the full R matrices in (7) and (8), respectively.

segments separately by forming a modified RkPDA by placing

k smaller nonoverlapping RPDA matrices along the diagonal.Again, these modified PDA matrices can be weighted andcompounded to incorporate information from multiple-scalerepresentations of the curve. That is, analogous to (7), we canform

R =p∑

k=1

kγRkPDA (8)

where, again, γ > 1.Fig. 2 shows details from R matrices formed using (7) and

(8). Notice that the R matrix created using (8) retains thebanded diagonal form of the original PDA matrix and that thevalues along the diagonal are directly related to how muchdifferences between neighboring bands (at the correspondinglocation) are ignored by the resulting classifier.

V. EXPERIMENTS

A. Data Sets

To give an evaluation of the discussed regularization ap-proaches, we have performed experiments on four hyperspec-tral images captured with different sensors and of varioussceneries. The number of spectral bands range from 71 to

Page 5: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

JENSEN et al.: USING MULTISCALE SPECTRA IN REGULARIZING COVARIANCE MATRICES 1855

171. The first image, Pavia [18], is of an urban scene in Italy,captured by an airborne sensor under the HySens project onJune 8, 2002. This data set consisted originally of 79 bands, butthe last seven bands, which capture thermal infrared, togetherwith a clearly noisy band at about 1.9580 μm, were excluded.The thermal infrared bands were omitted to make the data setcomparable to the other data sets which all stem from sensorscovering only the reflective part of the spectrum. In the range496–2412 nm, 71 bands are used. The image has a pixel sizeof 2.6 m, and the data are reflectance estimates. The groundtruth is divided into nine classes. The second image, Rosis[19], was captured by an airborne sensor (ROSIS-02) duringthe European Multisensors Airborne Campaign on May 10,1994. The location is the Fontainebleau forest south of Paris,containing ground-truthed areas of oak, beech, and pine trees.The data set has 81 spectral bands ranging from 430–830 nm.The image has a pixel size of 5.6 m and contains reflectanceestimates without atmospheric correction. The third and fourthimages, Botswana and Kennedy Space Center (KSC), are in-tended for vegetation inventory [20]. The former was capturedby the Hyperion sensor aboard the NASA EO-1 satellite overthe Okavango Delta, Botswana, on May 31, 2001, has a pixelsize of 30 m, and the labeled data consist of 14 classes. Thisdata set was preprocessed by the University of Texas Center forSpace Research (CSR) where uncalibrated and noisy bands thatcovered water absorption were removed. The number of rawradiance bands used is 145 and covers the range 448–2355 nm.The latter data set was captured by an airborne sensor overKSC at Cape Canaveral, Florida, on March 23, 1996, has apixel size of about 20 m, and the ground truth consists of13 classes. Preprocessing of the data was performed by CSR,including atmospheric correction and reflectance estimation inaddition to mitigating the effects of bad detectors, interdetectormiscalibration, and intermittent anomalies. The 171 bands arein the spectral range of 409–2438 nm. All the above data setsare well known, and the listed references are publications wherethese data sets are used with various classification algorithms.

Except for the Pavia data set, for which we used less thanhalf the available data for training the classifiers, each ofthe data sets were divided into two equally sized, spatiallyseparate, and mostly disjoint training and test sets. The full-sized training (and test) sets had a size of 1859 (13 275), 8085(8777), 2592 (2619), and 1607 (1641) samples for the Pavia,Rosis, KSC, and Botswana set, respectively. After finding themultiresolution broad-covering spectral bands (which gives theVks), all data sets were normalized by subtracting the total meanand rescaling the mean within-class variances to one.

B. Experimental Details

Four different approaches for forming regularization ma-trices were implemented and compared in the QDA setting:the traditional ridge approach, in which R equals the iden-tity matrix; the banded diagonal PDA matrix described inSection III-C; and the two matrices using multiresolution in-formation in (7) and (8). Based on heuristic assumptions andempiric results, γ was set to two in (7) and (8). To ease compar-ison over different λ, the matrices were all normalized to have

trace equal to the data-set dimensionality. Tenfold CV was usedto determine the regularization parameters. Comparisons with aQDA classifier after a dimensionality reduction transform ontoa single representation scale, as discussed in Section IV, werealso performed.

To help quantify how robust the proposed approaches are,we ran several experiments where we randomly sampled 80%of the training data to form new training sets which were usedto build our classifiers.

To get a rough idea of how these regularization methodsperform compared to other state-of-the-art classifiers, we alsoshow some results using the support vector machine (SVM)classifier. The chosen SVM is based on an RBF kernel, and itsparameters were selected based on a grid search and tenfoldCV [21].

C. Results and Discussion

Fig. 3 shows the effect of varying the type and amountof regularization has on the misclassification rate for all fourdata sets. In two of the data sets, ROSIS and Botswana, theregularization approaches incorporating multiscale informationreach the lowest classification error, while for the KSC set, theygive a slightly worse best rate than using the PDA regularizationmatrix. The high-resolution urban set, Pavia, shows a clearpreference for the ridge regularizer. If we study the choicesof regularization determined by tenfold CV, listed in Table I,we see that the multiscale approaches give better or equal per-formance on all the sets except for the Pavia set. It is not easy topin down exactly why the results for the Pavia set is so differentfrom what we see using the other sets; however, the poorresults using a single scale of the curve-segmentation algorithmfor dimensionality reduction seen in Table I indicate that themultiscale-representation algorithm has problems extractingthe necessary features for efficient classification of this qual-itatively different urban data. This can lead to unsatisfactorymultiscale representation, which is the basis of constructing theregularization matrix. The resolution of the Pavia set might be anecessary part in explaining the odd behavior of the classifiers,although probably not a sufficient one, as the resolution of theROSIS set can also be considered to be rather high.

Another noteworthy result shown in Fig. 3 is that the intervalfor which the regularizations give acceptable classification ratesare much wider for the multiscale approaches as comparedto what we see using the PDA matrix and, particularly, theridge approach. This aids the interpretation that the regions thatare deemed most likely to belong together, i.e., the regions inwhich the bands are in the same low-resolution band at manyscale representations, can be (almost) averaged into a singlefeature while the resulting classifier still retains performance.This focusing of the penalization of the band-value differencesin certain regions seems to leave the more critical regions forclassification less affected.

These long intervals of low classification errors could makeit easier to find a suitable λ in the final classifier. Table II showshow many times the different regularization approaches gotthe lowest misclassification rate in the 50 experiments usingrandomly drawn training sets. The results show that the long

Page 6: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

1856 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 4, APRIL 2010

Fig. 3. Classification errors using a QDA classifier with varying amount, andtype, of regularization. The circles indicate the choices of λ determined bytenfold CV.

TABLE IMISCLASSIFICATION RATE IN PERCENT FOR VARIOUS CLASSIFIERS.

TENFOLD CV WAS USED TO TUNE ALGORITHM PARAMETERS.(SINGLE SCALE) THE BOTTOM ROW SHOWS RESULTS FROM USING A

DIMENSIONALITY REDUCTION ONTO A LOWER SCALE REPRESENTATION,I.E., AS IN [17], WE USED TENFOLD CV TO CHOOSE ONE OF THE Vk’s IN

(7) AND APPLIED IT AS A DIMENSIONALITY REDUCTION TRANSFORM

TABLE IINUMBER OF TIMES EACH REGULARIZATION APPROACH GAVE THE

LOWEST MISCLASSIFICATION RATE IN OUR (80%) RANDOMLY DRAWN

TRAINING SETS EXPERIMENTS. NUMBERS IN PARENTHESIS SHOW THE

CORRESPONDING VALUES WHEN USING THE TEST SET ERROR

(AND NOT THE CV ERROR) TO CHOOSE THE BEST λ’s

Fig. 4. Classification errors using a QDA classifier with the regularizer in (7)for varying λ and γ values using the Botswana data set. Note the nonlinearity ofthe λ axis and that the best classification rate that can be reached (with varyingλ) becomes worse for larger γ values.

intervals of acceptable classification rate is clearly beneficial forat least two of the data sets. One of the data sets shows mixedresults, while again, the Pavia data set does not benefit from us-ing the multiscale information when building the regularizationmatrix at all.

Fig. 4 shows an example showing that when γ in (7) and(8) is increased, the range of λ giving acceptable classificationresults expands, although at the cost of a slightly worse bestclassification rate. An explanation for this might be that asγ is increased, the difference in emphasis between the seg-ments that are “surest” of belonging together and the rest isincreased, which allows for even longer intervals of λ withgood performance. However, a certain amount of balance in

Page 7: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

JENSEN et al.: USING MULTISCALE SPECTRA IN REGULARIZING COVARIANCE MATRICES 1857

the distribution of penalization across the curve is apparentlyneeded to get a good classifier, as is evident with the increasedlowest misclassification rate when γ is much larger than two.Note that although the curves shown in Fig. 4 seem to be shiftedto the right, this is mainly due to the normalization of theregularization matrix and its inability to adapt to the extremeweights given for large k in (7) and (8) when γ is large.

VI. CONCLUSION

We have introduced a simple and intuitive framework forspecifying and interpreting regularization matrices in the LDAand QDA classifier settings. It is proved that by adding a suit-able regularization matrix, both the LDA and QDA classifierscan be made to act as a compromise between the full dimen-sional classifier and the classifier one would get by first doinga linear feature reduction. These feature reduction transformscan be used to incorporate a priori knowledge into the clas-sifier, either by fully specifying the transforms using a prioriknowledge or by using knowledge of the problem to guidethe estimation of appropriate feature reduction transforms. Wehave used the framework to specify regularization matriceswhich incorporate multiscale information derived from a curve-segmentation algorithm. The results of the experiments indicatethat the proposed approach might be able to improve thepeak performance of the classifiers in hyperspectral settings,although perhaps the most interesting consequence of the pro-posed method is the enlarged regularization intervals givingacceptable results.

APPENDIX I

Here, we show that the equivalent of the LDA or the QDAclassifier in any linearly reduced subspace can be reached byusing an appropriate regularization matrix. We prove this byshowing that the discriminant function difference between anytwo classes become equal in both the full and dimensionality-reduced case. Let V be an orthogonal p × k matrix defining alinear dimensionality reduction from p to k dimensions, andlet R = I − V V ′ = V⊥V ′

⊥. We begin, in Appendix I-A, byshowing that the Mahalanobis distances become equal in boththe full and dimensionality-reduced space when using a specificaltered precision matrix. Then, in Appendix I-B, we show thatby adding λR to the sample covariance matrix, its inverseapproaches exactly this altered precision matrix. Finally, inAppendix I-C, we show that the ratios of the covariance matrixdeterminants also become equal, showing that the propositionholds for QDA as well as LDA.

A. Altering Σ−1 to Classify as in Reduced Space

Let x be a point in the full p-dimensional space and z bethe corresponding point in the reduced (k-dimensional) space,i.e., z = V ′x. We denote Σ as the full p × p sample covariancematrix and Σk as the k × k sample covariance matrix in thereduced space. We have of course Σk = V ′ΣV , and let usassume Σk to be invertible.

Replacing Σ−1 by (V V ′ΣV V ′)†, where † denotes theMoore–Penrose pseudoinverse, causes the distances in the fulland reduced spaces to be equal:

z′Σ−1k z = x′V (V ′ΣV )−1V ′x = x′(V V ′ΣV V ′)†x. (9)

Proving the last equation in (9) is the same as proving thatV (V ′ΣV )−1V ′ is the pseudoinverse of V V ′ΣV V ′. That is,by letting A = V V ′ΣV V ′ and B = V (V ′ΣV )−1V ′, we mustshow that AB and BA are symmetric and that ABA = A andBAB = B:

AB = V V ′ΣV

I︷︸︸︷V ′V (V ′ΣV )−1V ′ = V V ′ = (AB)′

BA = V (V ′ΣV )−1

I︷︸︸︷V ′V V ′ΣV V ′ = V V ′ = AB

ABA = (V

I︷ ︸︸ ︷V ′)V V ′ΣV V ′ = V V ′ΣV V ′ = A

BAB = (V

I︷ ︸︸ ︷V ′)V (V ′ΣV )−1V ′ = V (V ′ΣV )−1V ′ = B.

B. Continuous Dimensionality Reduction

In this section, we show that by adding λR to the full dimen-sional sample covariance matrix, its inverse will approach thealtered precision matrix in (9). That is, we prove that

(Σ + λR)−1 −−−−→λ → ∞ (V V ′ΣV V ′)† (10)

where † denotes the pseudoinverse using the k nonzero eigen-values.

Let A = Σ + λR. We know that A and A−1 have the sameeigenvectors, although with inverted eigenvalues. In addition, Aand B = (1/λ)A have, of course, the same eigenvectors. If x isan eigenvector of R with eigenvalue λi, then x will also be aneigenvector of B (and thus A) in the limit as λ → ∞:

Bx =1λ

Σx + Rx−−−−→λ → ∞ Rx = λix.

This shows that in the limiting case, any eigenvector of Ais either an eigenvector of R and thus fully in the subspacenull(V ) or is fully in span(V ).

Any eigenvector x of A, where Rx �= 0, has an eigenvaluethat goes to ∞ as λ → ∞:

limλ→∞

x′Ax = limλ→∞

x′Σx + λ

>0︷ ︸︸ ︷x′Rx = ∞

⇒ limλ→∞

x′A−1x = 0.

Thus, as λ → ∞, the only eigenvectors of A−1 with nonzeroeigenvalues are the k eigenvectors included in (and spanning)span(V ).

Now, let Ω = V V ′. The eigenvectors in span(V ) are re-tained when multiplied by Ω, i.e., x ∈ span(V ) ⇒ Ωx = x,while x ∈ null(V ) ⇒ Ωx = 0. Letting the columns of S con-tain the eigenvectors of A as λ → ∞, we see that ΩS hask nonzero columns corresponding to the k eigenvectors in

Page 8: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

1858 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 4, APRIL 2010

span(V ). We can thus remove the eigenvectors in null(V ) bymultiplying Ω on both sides of A:

(ΩS)D(S ′Ω) = Ω(SDS ′)Ω = ΩAΩ

where D is a diagonal matrix containing the eigenval-ues of A. Furthermore, the eigenvectors in span(V ) areindependent of λ:

ΩAΩ = ΩΣΩ + λ

0︷ ︸︸ ︷ΩRΩ = ΩΣΩ.

Thus, the eigenvectors and inverted eigenvalues of ΩΣΩ arethe same as the nonzero eigenvectors and eigenvalues inlimλ→∞(Σ + λR)−1, which concludes the proof.

C. |Σi|/|Σj | Ratio in QDA

We know, from Appendices I-A and B, that our proposition iscorrect for the second term in (1), the Mahalanobis distance. InLDA, this is all that is needed to show equality of the classifiers,but in QDA, we must also show that the ratio |Σi|/|Σj | for anytwo classes i and j or the difference between their log valuesbecomes equal in the regularized and the feature-reduced case.That is, letting Σkc denote class c’s k × k sample covariancematrix, we must show that (choosing classes 1 and 2 fornotational simplicity)

|Σ1 + λR||Σ2 + λR|

−−−−→λ → ∞ |Σk1|

|Σk2|

which is equivalent to show that

p∑i=1

log λ1i− log λ2i

−−−−→λ → ∞

k∑i=1

log λk1i− log λk2i

where λciand λkci

are the ith eigenvalue of Σc + λR and Σkc,respectively.

From the proof in Section I.B, we know that, as λ → ∞, thek smallest eigenvalues of Σ + λR will be equal to the k nonzeroeigenvalues of ΩΣΩ. Furthermore, the nonzero eigenvaluesof ΩΣΩ are equal to the eigenvalues of Σk, since if z isan eigenvector with eigenvalue τ of Σk, then x = V z is aneigenvector of ΩΣΩ with the same eigenvalue, τ :

(ΩΣΩ)x = V V ′ΣV

I︷︸︸︷V ′V z = V

τz︷ ︸︸ ︷V ′ΣV z = τV z = τx.

Since the rest of the eigenvalues will approach those of λR (asλ increases), we have

p−k∑i=1

0 as λ→∞︷ ︸︸ ︷log λ1i

−log λ2i+

p∑i=p−k+1

log λ1i−log λ2i

−−−−→λ → ∞

p∑i=p−k+1

log λ1i−log λ2i

=k∑

i=1

log λk1i−log λk2i

which concludes the proof.

APPENDIX II

Although it is a well-known result, as a means to autonomizethe text, we prove in this section that two linear feature reduc-tion transforms V0 and V1 give the same Mahalanobis distancesand log-determinant ratios in their reduced spaces as long asspan(V0) = span(V1). That is, any linear transform can bemade orthogonal (i.e., V having orthogonal columns) withoutaffecting the resulting classifier.

Let V0 and V1 be two p × k matrices representing the lin-ear feature reduction transforms, and A be a k × k invertiblechange of basis matrix: V0 = V1A. Let x be an arbitrary pointin p-dimensional space and let z0 = V ′

0x and z1 = V ′1x. The

Mahalanobis distances after applying V0 and V1 are then equal:

z′0Σ̃0−1

z0 = x′V0 (V ′0ΣV0)

−1V ′

0x

= x′V1A(A′V ′1ΣV1A)−1A′V ′

1x

= x′V1

I︷ ︸︸ ︷AA−1 (V ′

1ΣV1)−1

I︷ ︸︸ ︷A′−1A′ V ′

1x

= x′V1(V ′1ΣV1)−1V ′

1x = z′1Σ̃1−1

z1.

The ratio of the covariance matrix determinants, needed inQDA, also becomes equal (again, choosing classes 1 and 2 fornotational convenience)

|V ′0Σ1V0|

|V ′0Σ2V0| =

|A′V ′1Σ1V1A|

|A′V ′1Σ2V1A|

=|A′||A| |V ′

1Σ1V1||A′||A| |V ′

1Σ2V1|

=|V ′

1Σ1V1||V ′

1Σ2V1| .

ACKNOWLEDGMENT

The authors would like to thank Prof. G. Storvik for valuableinput during the preparation of this paper.

REFERENCES

[1] A. C. Jensen, A. Berge, and A. S. Solberg, “Regression approaches tosmall sample inverse covariance matrix estimation for hyperspectral im-age classification,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 10,pp. 2814–2822, Oct. 2008.

[2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.New York: Wiley-Interscience, Nov. 2000.

[3] J. Friedman, “Regularized discriminant analysis,” J. Amer. Stat. Assoc.,vol. 84, pp. 165–175, Mar. 1989.

[4] S. Aeberhard, D. Coomans, and O. de Vel, “Comparative analysis of sta-tistical pattern recognition methods in high dimensional settings,” PatternRecognit., vol. 27, no. 8, pp. 1065–1077, Aug. 1994.

[5] T. Hastie, A. Buja, and R. Tibshirani, “Penalized discriminant analysis,”Ann. Stat., vol. 23, no. 1, pp. 73–102, 1995.

[6] A. Berge, A. C. Jensen, and A. S. Solberg, “Sparse inverse covarianceestimates for hyperspectral image classification,” IEEE Trans. Geosci.Remote Sens., vol. 45, no. 5, pp. 1399–1407, May 2007.

[7] J. Z. Huang, N. Liu, M. Pouhramadi, and L. Liu, “Covariance selectionand estimation via penalised normal likelihood,” Biometrika, vol. 93,no. 1, pp. 85–98, Mar. 2006.

[8] W. J. Krzanowski, “Antedependence models in the analysis of multi-grouphigh-dimensional data,” J. Appl. Stat., vol. 26, no. 1, pp. 59–67, Jan. 1999.

[9] P. J. Bickel and E. Levina, “Regularized estimation of large covariancematrices,” Ann. Stat., vol. 36, no. 1, pp. 199–227, 2008.

Page 9: Using Multiscale Spectra in Regularizing Covariance Matrices for Hyperspectral Image Classification

JENSEN et al.: USING MULTISCALE SPECTRA IN REGULARIZING COVARIANCE MATRICES 1859

[10] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, Jul. 2008.

[11] J. Hoffbeck and D. Landgrebe, “Covariance matrix estimation and clas-sification with limited training data,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 18, no. 7, pp. 763–767, Jul. 1996.

[12] B.-C. Kuo and K.-Y. Chang, “Feature extractions for small sample sizeclassification problem,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 3,pp. 756–764, Mar. 2007.

[13] B. Yu, M. Ostland, P. Gong, and R. Pu, “Penalized discriminant analysisof in situ hyperspectral data for conifer species recognition,” IEEE Trans.Geosci. Remote Sens., vol. 37, no. 5, pp. 2569–2577, Sep. 1999.

[14] T. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of hy-perspectral images with regularized linear discriminant analysis,” IEEETrans. Geosci. Remote Sens., vol. 47, no. 3, pp. 862–873, Mar. 2009.

[15] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Berlin, Germany:Springer-Verlag, 2001.

[16] G. Strang, “The discrete cosine transform,” SIAM Rev., vol. 41, no. 1,pp. 135–147, Mar. 1999.

[17] A. C. Jensen and A. S. Solberg, “Fast hyperspectral feature reductionusing piecewise constant function approximations,” IEEE Geosci. RemoteSens. Lett., vol. 4, no. 4, pp. 547–551, Oct. 2007.

[18] P. Gamba, “A collection of data for urban area characterization,” in Proc.IEEE IGARSS, 2004, pp. 69–72.

[19] A. H. S. Solberg, “Contextual data fusion applied to forest map revi-sion,” IEEE Trans. Geosci. Remote Sens., vol. 37, no. 3, pp. 1234–1243,May 1999.

[20] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation of therandom forest framework for classification of hyperspectral data,” IEEETrans. Geosci. Remote Sens., vol. 43, no. 3, pp. 492–501, Mar. 2005.

[21] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector ma-chines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm

Are C. Jensen (S’05–M’09) received the M.Sc. andPh.D. degrees from the University of Oslo, Oslo,Norway, in 2003 and 2009, respectively.

He is currently with the Department of Informat-ics, University of Oslo. His research interests in-clude pattern recognition, image analysis, and signalprocessing for imaging.

Marco Loog received the Ph.D. degree from theImage Sciences Institute, Utrecht, The Netherlands,for the development and improvement of statisticalpattern recognition methods and their use in theprocessing and analysis of images.

Afterward, he moved to Copenhagen, Denmark,were he acted as Assistant and, eventually, AssociateProfessor. After several years, he relocated to theDelft University of Technology, Delft, The Nether-lands, where he now works as an Assistant Professorand Coordinator of the Pattern Recognition Labora-

tory. Currently, his ever-evolving research interests include multiscale imageanalysis, semisupervised and multi-instance learning, saliency, the dissimilarityapproach, and black math.

Anne H. Schistad Solberg (S’92–M’96) receivedthe M.S. degree in computer science and the Ph.D.degree in image analysis, both from University ofOslo, Oslo, Norway, in 1989 and 1995, respectively.

During 1991–1992, she was a Visiting Scholarwith the Department of Computer Science, MichiganState University, East Lansing. She is currently aProfessor in the Digital Signal Processing and ImageAnalysis Group with the Department of Informatics,University of Oslo. Her research interests includeSAR image analysis, oil spill detection, hyperspec-

tral imagery, statistical classification, and data fusion.