A Novel Indefinite Kernel Dimensionality Reduction Algorithm: Weighted Generalized Indefinite Kernel...

13
Neural Process Lett DOI 10.1007/s11063-013-9330-9 A Novel Indefinite Kernel Dimensionality Reduction Algorithm: Weighted Generalized Indefinite Kernel Discriminant Analysis Jing Yang · Liya Fan © Springer Science+Business Media New York 2013 Abstract Kernel methods are becoming increasingly popular for many real-world learning problems. And these methods for data analysis are frequently considered to be restricted to positive definite kernels. In practice, however, indefinite kernels arise and demand application in pattern analysis. In this paper, we present several formal extensions of kernel discriminant analysis (KDA) methods which can be used with indefinite kernels. In particular they include indefinite KDA (IKDA) based on generalized singular value decomposition (IKDA/GSVD), pseudo-inverse IKDA, null space IKDA and range space IKDA. Similar to the case of LDA- based algorithms, IKDA-based algorithms also fail to consider that different contribution of each pair of class to the discrimination. To remedy this problem, weighted schemes are incorporated into IKDA extensions in this paper and called them weighted generalized IKDA algorithms. Experiments on two real-world data sets are performed to test and evaluate the effectiveness of the proposed algorithms and the effect of weights on indefinite kernel functions. The results show that the effect of weighted schemes is very significantly. Keywords Indefinite kernel discriminant analysis · Undersampled problem · Weighting function · Indefinite kernel function · Classification accuracy 1 Introduction Kernel method [1, 2] such as kernel discriminant analysis (KDA) have recently attracted much attention due to their good generalization performance and appealing optimization approaches. The essence of kernel discriminant analysis is to map the data into an implicit J. Yang (B ) School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, People’s Republic of China e-mail: [email protected] L. Fan School of Mathematics Sciences, Liaocheng University, Liaocheng 252059, People’s Republic of China e-mail: [email protected] 123

Transcript of A Novel Indefinite Kernel Dimensionality Reduction Algorithm: Weighted Generalized Indefinite Kernel...

Neural Process LettDOI 10.1007/s11063-013-9330-9

A Novel Indefinite Kernel Dimensionality ReductionAlgorithm: Weighted Generalized Indefinite KernelDiscriminant Analysis

Jing Yang · Liya Fan

© Springer Science+Business Media New York 2013

Abstract Kernel methods are becoming increasingly popular for many real-world learningproblems. And these methods for data analysis are frequently considered to be restricted topositive definite kernels. In practice, however, indefinite kernels arise and demand applicationin pattern analysis. In this paper, we present several formal extensions of kernel discriminantanalysis (KDA) methods which can be used with indefinite kernels. In particular they includeindefinite KDA (IKDA) based on generalized singular value decomposition (IKDA/GSVD),pseudo-inverse IKDA, null space IKDA and range space IKDA. Similar to the case of LDA-based algorithms, IKDA-based algorithms also fail to consider that different contributionof each pair of class to the discrimination. To remedy this problem, weighted schemes areincorporated into IKDA extensions in this paper and called them weighted generalized IKDAalgorithms. Experiments on two real-world data sets are performed to test and evaluatethe effectiveness of the proposed algorithms and the effect of weights on indefinite kernelfunctions. The results show that the effect of weighted schemes is very significantly.

Keywords Indefinite kernel discriminant analysis · Undersampled problem · Weightingfunction · Indefinite kernel function · Classification accuracy

1 Introduction

Kernel method [1,2] such as kernel discriminant analysis (KDA) have recently attractedmuch attention due to their good generalization performance and appealing optimizationapproaches. The essence of kernel discriminant analysis is to map the data into an implicit

J. Yang (B)School of Computer Science and Technology, Nanjing Universityof Science and Technology, Nanjing 210094, People’s Republic of Chinae-mail: [email protected]

L. FanSchool of Mathematics Sciences, Liaocheng University,Liaocheng 252059, People’s Republic of Chinae-mail: [email protected]

123

J. Yang, L. Fan

high-dimensional feature space, but these kernel methods are considered to be restricted topositive definite kernels. In practical applications, however, there is much larger class of ker-nel functions available which do not necessarily be positive definite but which nonethelesscan be used for pattern analysis. Therefore, in this paper, we present formal extensions ofsome kernel discriminant analysis methods which can be used with indefinite kernels. How-ever, as the LDA-based methods, the IKDA-based methods usually encounter two problems.One is the singularity problem caused by the undersampled problems. In order to over-come this problems, we briefly present several IKDA extensions, such as, IKDA/GSVD [3],pseudo-inverse IKDA (PIKDA),null space IKDA (NIKDA), range space IKDA (RSIKDA).IKDA/GSVD is one of generalizations of LDA based on indefinite kernel functions andGSVD, it overcomes the singularity of the scatter matrices by applying the GSVD to solvethe generalized eigenvalue problem in the feature space. We easy see the IKDA solution is aspecial case of IKDA/GSVD method. In PIKDA, the inverse of the indefinite kernel scattermatrix is replaced by the pseudo-inverse. In NIKDA, the between-class distance is maximizedin the null space of the within-class scatter matrix of the indefinite kernel matrix. The rangespace IKDA is a method based on the transformation by a basis of the range space of thewithin-class scatter matrix of the indefinite kernel matrix to handle undersampled problems.

Another drawback of the IKDA method is that it fails to consider that different con-tribution of each pair of class to the discrimination. A promising solution to this problemis to introduce weighted schemes into the criteria. Motivated by the form of the weightedbetween-class scatter matrix proposed by Loog et al. [4]. In this paper, we reformulate theIKDA-based methods in the weighted forms, we call them generalized weighted IKDA algo-rithms, where the weight aims to emphasize the different roles of the individual class pairsin the discrimination. By using the weighted form of the between-class scatter matrix, wecan make use of the advantage of the class membership. Moreover, we present weighted ver-sions of IKDA/GSVD, PIKDA, null space IKDA and range space IKDA with five weightingfunctions for each weighted scheme, where the K-Nearest neighbors (KNN) method [5] isused for a classifier. We apply the pseudo-Euclidean distance di j = ‖mi − m j‖4 between themeans of classes i and j in weighting function w(di j ). A weighting function is generally amonotonically decreasing function because classes that are closer to one another are likely tohave a greater confusion and should be given a greater weightage. In this paper, we focus onstudy the effectiveness of the proposed algorithms and the effect of weights on the indefinitekernel functions and dimensional algorithms. From recent research in weighted methods,we can see an appropriate choice of the weight on the criterion plays a crucial role in theperformance delivered. To further study the effect of weighting functions, we choice differentkernel functions and two real-world data sets in our experiments. Extensive comparisons ofdifferent weighting functions on IKDA methods are conducted.

The rest of the paper is organized as follows. In Sect. 2, we briefly review indefinite kernel.The IKDA and generalized IKDA algorithms are presented in this section. Weighted versionof generalized IKDA algorithms and weighting functions are introduced in Sect. 3. Extensiveexperiments with proposed algorithms have been performed in Sect. 4, the results demonstratethe effectiveness of the proposed algorithms and the effect of weighting functions. Conclusionfollows in Sect. 5.

2 Generalized Indefinite Kernel Discriminant Analysis

In this section, we first give a short discussion about the indefinite kernel. The proper framefor indefinite kernel functions is indefinite vector spaces such as pseudo-Euclidean [6,7] or

123

A Novel Indefinite Kernel Dimensionality Reduction Algorithm

more general Krein spaces [8]. A Krein space over R is a vector space K equipped with anindefinite inner product 〈·, ·〉K such that K admits an orthogonal decomposition as a directsum K = K+ ⊕ K−, where (K+, 〈·, ·〉+) and (K−, 〈·, ·〉−) are separable Hilbert spaceswith their corresponding positive definite inner products. Let P+ onto K+ and P− onto K−are orthogonal projections, for any ξ ∈ K can be represented as ξ = P+ξ + P−ξ , whileIK = P+ + P− is the identity operator. The linear operator J = P+ − P− is the basiccharacteristic of a Krein space K, satisfying J = J −1 = J T. The space K can be turninto its associated Hilbert space H by using the positive definite inner product 〈ξ, ξ ′〉H =〈ξ,J ξ ′〉K. We use the “transposition” abbreviation ξTξ ′ = 〈ξ, ξ ′〉H and now additionally(motivated by J operating as a sort of “conjugation”), a “conjugate-transposition” notationξ∗ξ ′ = 〈ξ, ξ ′〉K = 〈J ξ, ξ ′〉H = (J ξ)Tξ ′ = ξTJ ξ ′.

Finite-dimensional Krein space with K+ = Rp and K− = R

q are denoted by R(p,q) and

called pseudo-Euclidean spaces. The pair (p, q) is the signature of the pseudo-Euclideanspace, J becomes the matrix J = diag(1p,−1q) with respect to an orthonormal basis inR(p,q), where 1p ∈ R

p denotes the p-dimensional vector of all ones. The indefinite innerproduct defines a squared norm, as usual, ‖x‖2

K = 〈x, x〉K, which consequently defines thesquared distances ‖x − x ′‖2

K.We assume to have a data matrix X = [x1, . . . , xN ] ∈ R

m×N , assume the original datais already clustered and partitioned into r classes. Let X = [X1, . . . , Xr ], where Xi ∈R

m×Ni is the i th class containing Ni examples, with x ji denoting the j th example in Xi ,

and∑r

i=1 Ni = N . Let ψ : X → K be a mapping of the data into a Krein space K and� = [ψ(x1), . . . , ψ(xN )] be a sequence of images of X in K. Assume that the indefinitekernel function k : X × X → R encodes the inner product k(xi , x j ) = 〈ψ(xi ), ψ(x j )〉K =ψ(xi )

TJψ(x j ) in K, as a result, the kernel matrix is K = �TJ�. The total mean ofthe mapping data is defined as m� = (1/N )

∑Ni=1 ψ(xi ) = (1/N )�1N with 1N ∈ R

N . If�i = [ψ(xi

1), . . . , ψ(xiNi)] denotes the i th class of the mapping data, we can definite its class

mean as m�i = (1/Ni )

∑Nij=1 ψ(x

ij ) = (1/Ni )�i 1Ni and we abbreviate the column-blocks

of the kernel matrix as Ki = �TJ�i ∈ RN×Ni .

2.1 Indefinite Kernel Discriminant Analysis

The Fisher linear discriminant attempts to find a directionw ∈ K such that the between-classscatter matrix is maximized while the within-class scatter is minimized. In analogy to thepositive definite case, the indefinite Fisher linear discriminant [9]

f (x) = 〈w,ψ(x)〉K + b = w∗ψ(x)+ b

is defined by the vector w that maximizes either of the following the Fisher criterions

J1(w) =⟨w, SK

b w⟩K⟨

w, SKw w

⟩K

= w∗SKb w

w∗SKw w

, (1)

J2(w) =⟨w, SK

b w⟩K⟨

w, SKt w

⟩K

= w∗SKb w

w∗SKt w

, (2)

where the between-class scatter operator acts as SKb w = ∑r

i=1 Ni (m�i −m�)

⟨m�

i −m�,w⟩K

= ∑ri=1 Ni (m�

i − m�)(m�i − m�)TJw. Then, we have SK

b = SHb J , where SH

b =∑ri=1 Ni (m�

i − m�)(m�i − m�)T is the Hilbert between-class scatter operator in H. Sim-

ilarly, the total scatter operator and the within-class scatter operator can be expressed asSK

t = SHt J and SK

w = SHw J with the Hilbert total scatter operator and the Hilbert

123

J. Yang, L. Fan

within-class scatter operator SHt = ∑N

j=1 (ψ(x j )− m�)(ψ(x j ) − m�)T and SHw =

∑ri=1

∑Nij=1 (ψ(x

ij )− m�

i )(ψ(xij ) − m�

i )T. The Fisher criterion (1) and (2) can therefore

be rewritten as

J′1(w) = wTJ SH

b JwwTJ SH

w Jw, (3)

J′2(w) = wTJ SH

b JwwTJ SH

t Jw, (4)

As for positive definite kernel discriminant analysis, we can show that the eigenvector wmust lie in a space spanned by {ψ(xi )}N

i=1 in K and thus it can be expressed in the form ofthe following linear expansion :

w =N∑

i=1

αiψ(xi ). (5)

Substituting Eq. (5) into the numerator and denominator of Eqs. (3) and (4), we obtainwTJ SH

b Jw = αT Kbα, wTJ SH

w Jw = αT Kwα and wTJ SHt Jw = αT Ktα, then we can

rewrite the Fisher criterion in Eqs. (3) and (4) as:

J1(α) = αT Kbα

αT Kwα. (6)

J2(α) = αT Kbα

αT Ktα. (7)

where α = (α1, α2, . . . , αN )T. Kb, Kw and Kt can be seen the scatter matrices of the

indefinite kernel matrix K = [k(xi , x j )]N×N when each column in K is considered as apoint in the N dimensional space, Kb, Kw and Kt can now be computed based on the kerneldata. We can define c = 1

N 1N and ci = 1Ni( 0, . . . , 0

︸ ︷︷ ︸N1+···+Ni−1

, 1, . . . , 1︸ ︷︷ ︸

Ni

, 0, . . . , 0︸ ︷︷ ︸

Ni+1+···+Nr

)T. Then the

Hilbert between-class scatter operator is rewritten as SHb = ∑r

i=1 Ni�(ci −c)(ci −c)T�∗ =∑ri=1 Ni�(ci − c)(ci − c)T�TJ , we obtain

Kb = �TJ SHb � = �TJ�

r∑

i=1

Ni (ci − c)(ci − c)T�TJ�

= Kr∑

i=1

Ni (ci − c)(ci − c)T K

=r∑

i=1

Ni[K T(ci − c)

] [K T(ci − c)

]T.

Further, the Hilbert within-class scatter operator is rewritten as

SHw =

r∑

i=1

�i

(

INi − 1

Ni1Ni 1

TNi

)(

INi − 1

Ni1Ni 1

TNi

)T

�∗i

=r∑

i=1

�i

(

INi − 1

Ni1Ni 1

TNi

)(

INi − 1

Ni1Ni 1

Tni

)T

�Ti J ,

123

A Novel Indefinite Kernel Dimensionality Reduction Algorithm

then, we have

Kw =r∑

i=1

�TJ�i

(

INi − 1

Ni1Ni 1

TNi

) (

INi − 1

Ni1Ni 1

TNi

)T

�Ti J�

=r∑

i=1

Ki Hi HTi K T

i =∑r

i=1(Ki Hi ) (Ki Hi )

T ,

where Hi = INi − 1Ni

1Ni 1TNi

. Similarly, we can show

Kt = �TJ�(

IN − 1

N1N 1T

N

) (

IN − 1

N1N 1T

N

)T

�TJ�

= K H HT K = (K H)(K H)T,

where H = IN − 1N 1N 1T

N . Note that Kb, Kw and Kt are positive semidefinite by construction.As a result, the solution to Eq. (1) can be obtained by maximizing the Eq. (6) instead, givingthe m leading eigenvectors α1, . . . , αm of the eigenvalue problem

Kbα = λKwα

as solution. Obviously, thanks to the positive semidefiniteness of Kb and Kw, the eigenvaluesλ are nonnegative.

For any input vector x , its low-dimensional feature representation y = (y1, . . . , ym)T can

then be obtained via the indefinite inner produxt as follows:

y = αT�∗ψ(x) = (α1, . . . , αm)T(k(x1, x), . . . , k(xN , x))T

Note that the solution above is based on the assumption that the indefinite kernel within-class scatter matrix Kw is invertible. However, identical to the positive case, the matrix Kwis singular and maximizing (6) is not well defined. In order to solve this problem, we canapply several generalizations of IKDA algorithms in the next paper.

2.2 Indefinite Kernel Discriminant Analysis Based on GSVD

In this subsection, we present a extension of LDA based on indefinite kernel functions andthe GSVD.

Howland et al. [10] proposed LDA/GSVD, which is an extension of LDA based on GSVD.It overcomes the singularity of the scatter matrices by applying the GSVD to solve thegeneralized eigenvalue problem. An efficient algorithm for LDA/GSVD was presented in[11]. Note that Kb, Kw and Kt are both singular and LDA cannot be applied, so as in[3], we can present a generalization of IKDA based on the GSVD, denoted IKDA/GSVD.Similarly to the efficient algorithm for LDA/GSVD, we can present an efficient algorithmfor IKDA/GSVD as follows.

Algorithm 2.1 IKDA/GSVD

1. Compute the EVD of Kt : Kt = [U1 U2 ][�1 00 0

][U T

1

U T2

]

.

2. Put Kb = �−1/21 U T

1 KbU1�−1/21 and compute V from the EVD of Kb : Kb = V�T

b�bV T.

3. Assign the first r − 1 columns of U1�−1/21 V to Gh .

4. For any input vector x , its low-dimensional feature representation z can thus be given byz = GT

h (k(x1, x), . . . , k(xN , x))T.

123

J. Yang, L. Fan

2.3 Pseudo-inverse IKDA

As discussed the IKDA, we know that Kb and Kw are both singular, the classical KDA cannotbe applied, one simple method for solving this problem is to use the pseudo-inverse of Kwinstead, let us call this method pseudo-inverse IKDA (PIKDA).

As we known in pseudo-inverse algorithm the inverse of a scatter matrix is replaced bythe pseudo-inverse, we can also easy see that IKDA/GSVD is a special case of pseudo-inverse IKDA. So, similar to Algorithm 2.1, we can derive an efficient algorithm for thepseudo-inverse IKDA as follows.

Algorithm 2.2 Pseudo-inverse IKDA

1. Compute the EVD of Kt : Kt = [U1 U2 ][�1 00 0

] [U T

1

U T2

]

.

2. Put Kb = �−1/21 U T

1 KbU1�−1/21 and compute V from the EVD of Kb : Kb = V�T

b�bV T.

3. Assign the first r − 1 columns of U1�−1/21 V to Xδ .

4. Put G = XδM , where M is any nonsingular matrix.5. For any input vector x , its low-dimensional feature representation z can thus be given by

z = GT(k(x1, x), . . . , k(xN , x))T.

2.4 IKDA Based on the Projection Onto Null(Kw)

Chen et al. [12] proposed the null space LDA (NLDA) for dimensionality reduction ofundersampled problems, in this subsection, we present a method which the indefinite kernelfunction is incorporated into LDA in the null space of within-class scatter matrix, we callthis method NIKDA for short. Let the EVD of Kw ∈ RN×N be

Kw = Uw�wU Tw = [Uw1 Uw2 ]

[�w1 0

0 0

] [U Tw1

U Tw2

]

,

where s1 = rank(Kw),Uw is orthogonal, �w1 is a diagonal matrix with nonincreasingpositive diagonal elements and Uw1 contains the first s1 columns of the orthogonal matrixUw. We can easily show that null(Kw) = span(Uw2) and the transformation by Uw2U T

w2projects the data in the feature space K to null(Kw). The between-class scatter matrix Kb inthe transformed space is Kb = Uw2U T

w2 KbUw2U Tw2. Consider the EVD of Kb:

Kb = Ub�bU Tb = [ Ub1 Ub2 ]

[�b1 0

0 0

][U T

b1

U Tb2

]

,

where s2 = rank(Kb), Ub1 ∈ RN×s2 and �b1 ∈ R

s2×s2 . In NIKDA, the optimal transfor-mation matrix Ge is obtained by Ge = Uw2U T

w2Ub1. Hence, for any input vector x , its low-dimensional feature representation z can thus be given by z = GT

e (k(x1, x), . . . , k(xN , x))T.

2.5 Range Space IKDA

Similar to NIKDA, we present a method which transforms the feature space by using a basisof range(Kw), we denote shortly this method by RSIKDA. Let the EVD of Kw ∈ R

N×N be

Kw = Uw�wU Tw = [Uw1 Uw2 ]

[�w1 0

0 0

] [U Tw1

U Tw2

]

,

123

A Novel Indefinite Kernel Dimensionality Reduction Algorithm

where s1 = rank(Kw),�w1 is a diagonal matrix and Uw1 ∈ RN×s1 . We can easily show that

range(Kw) = span(Uw1) and the transformation by Vy = Uw1�−1/2w1 projects the data in the

feature space K to range(Kw). The within-class scatter matrix Kw in the transformed spaceis Kw = V T

y KwVy = Is1 . Let the EVD of Kb ≡ V Ty KbVy be

Kb = Ub�bU Tb = [ Ub1 Ub2 ]

[�b1 0

0 0

][U T

b1

U Tb2

]

,

where s3 = rank(Kb), �b1 is a diagonal matrix and Ub1 ∈ Rs1×s3 . In RSIKDA, the optimal

transformation matrix G y is obtained by

G y = VyUb1 = Uw1�−1/2w1 Ub1.

Hence, for any input vector x , its low-dimensional feature representation z can thus be givenby z = GT

y (k(x1, x), . . . , k(xN , x))T.

3 Weighted Version and Weighting Functions

3.1 Weighted Indefinite Kernel Discriminant Analysis

Similar to the case of LDA-based methods, IKDA-based algorithms also fail to consider thatdifferent contribution of each pair of class to the discrimination. If two of the class means arefar away from each other, which means that they are well separated, then their contributionsto the discrimination task is minor. However, if two of the class means are close together,which means that they are not well separated, then finding the discriminant vectors that canbetter separate them is important to improve the discriminant performance. So in order tocontrol the contribution of each pair of class to the discrimination, a commonly method is toincorporate a weighting function into the criterion by using a weighted between-class scattermatrix in place of the ordinary between-class scatter matrix. We can rewrite the Hilbertbetween-class scatter matrix as

SHb = �

r−1∑

i=1

r∑

j=i+1

Ni N j

N(ci − c j )(ci − c j )

T�TJ .

Then, we can define weighted between-class scatter matrix in H as follows:

SHB = �

r−1∑

i=1

r∑

j=i+1

Ni N j

Nw(di j )(ci − c j )(ci − c j )

T�TJ (8)

where w(di j ) is a weighting function, which is a monotonically decreasing function of thepseudo-Euclidean distance di j = ‖m�

i − m�j ‖4. Apparently, the weighted Hilbert between-

class scatter matrix SHB degenerates to the Hilbert between-class scatter matrix SH

b if theweighting function in (8) gives a constant weight value. In addition, it is clear that di j can becalculated by the kernel trick as follows:

di j =∥∥∥m�

i − m�j

∥∥∥

4 =[(

m�i − m�

j

)∗ (m�

i − m�j

)]2 =[(

ci − c j)T

K(ci − c j

)]2.

123

J. Yang, L. Fan

Based on the definition of SHB in Eq. (8), we define a new Fisher criterion as

J1(w) = wTJ SHB Jw

wTJ SHw Jw . (9)

Again, we can express the solutionw = ∑Ni=1 αiψ(xi ) and hence rewrite the Fisher criterion

in Eq. (9) as

J1(α) = αT K Bα

αT Kwα,

where α = (α1, . . . , αN )T,

K B = Kr−1∑

i=1

r∑

j=i+1

Ni N j

Nw(di j )(ci − c j )(ci − c j )

T K

=r−1∑

i=1

r∑

j=i+1

Ni N j

Nw(di j )

[K T(ci − c j )

] [K T(ci − c j )

]T.

The solution to Eq. (9) is thus the m leading eigenvectors α1, . . . , αm of the matrix K −1w K B .

For any input vector x , its low-dimensional feature representation z = (z1, . . . , zm)T can

then be obtained as

z = (w1, . . . , wm)T(k(x1, x), . . . , k(xN , x))T.

If the Hilbert between-class scatter matrix SHb is replaced by the weighted between-class

scatter matrix SHB , by means of the algorithms obtained in Sect. 2, we can get some weighted

generalized IKDA methods, which we called weighted IKDA/GSVD, weighted PIKDA,weighted NIKDA, weighted RSIKDA.

3.2 Weighting Functions

We can see from [4,13,14] that weighting functions have close relationships with classifi-cation accuracy. Different weighting functions can product different classification error forweighted generalized IKDA methods. Selecting suitable weighting function can increase clas-sification accuracy. In this paper, we consider four weighted generalized IKDA methods withfive weighting functions for each weighted scheme. We apply the pseudo-Euclidean distancedi j = ‖m�

i − m�j ‖4 between the means of classes i and j in weighting functions w(di j ).

A weighting function is generally a monotonically decreasing function because classes thatare closer to one another are likely to have a greater confusion and should be given a greaterweightage.

According to the fractional-step LDA procedure in [13], the weighting function shoulddrop faster than the pseudo-Euclidean distance between the class means for �i and � j inK. We first apply two special cases of the weighting function w(di j ) = (di j )

−p proposedby Lotlikar et al. [8] with p = 1 and p = 2, and then an improved version of weightingfunctionw(di j ) = 1

2d2i j

er f (di j

2√

2)presented by Loog et al. [4], where the Mahanalobis distance

is replaced by the Euclidean distance. In addition, according to the feature of weightingfunctions mentioned above, we present two new weighting functions. They are listed below:

w1: w(di j ) = (di j )−2,

w2: w(di j ) = 12d2

i jer f (

di j

2√

2),

123

A Novel Indefinite Kernel Dimensionality Reduction Algorithm

w3: w(di j ) = (di j )−1,

w4: w(di j ) = e1

di j ,w5: w(di j ) = 1

edi j.

4 Experiments

In this section, in order to explain the effective of the proposed methods and illustrate theeffect of the weights on indefinite kernel functions, we conduct a series of experiments withtwo different data sets, and we compare three indefinite kernel functions with a positivekernel function. Specifically, the two data sets are respectively Dermatology database takenfrom the UCI Machine Learning Repository [15] and the Yale database [16]. The indefinitekernel functions are respectively Multiquadric kernel, Epanechnikov kernel and Gaussiancombination kernel [17]:

kMul(x, y) =√

‖x − y‖2

σm+ c2, σm, c ∈ R,

kEpan(x, y) =(

1 − ‖x − y‖2

σe

)d

,‖x − y‖2

σe� 1,

kGc(x, y) = exp(−‖x − y‖2/σ 2

1

) − exp(−‖x − y‖2/σ 2

2

), σ1, σ2 ∈ R.

and the positive definite kernel function is Gaussian RBF kernel [18]

kRB F (x, y) = exp(−‖x − y‖2/σ 2) , σ ∈ R.

In our experiments, the KNN classifier with K = 5 is used as a classifier for all data sets.Moreover, we randomly generate 5 matrices for M and compute the misclassification ratesby using the optimal transformation matrices produced in weighted PIKDA for each dataset. For each method, we randomly split the data to the training and test set of equal size andrepeat it 10 times to obtain mean prediction misclassification rates.

4.1 Dermatology Data

This data set contains 34 attributes, 33 of which are linear valued and one of them is nominal,it includes six classes: psoriasis, seboreic dermatitis, ichen planus, pityriasis rosea , cronicdermatitis, pityriasis rubra pilaris, there are 366 instances and we choose 358 samples whichnot include “?” for our experiments.

In our experiment, the three indefinite kernel functions

kMul(x, y) =√

‖x − y‖2 + 1,

kEpan(x, y) = 1 − ‖x − y‖2

200,

kGc(x, y) = exp(−‖x − y‖2/108) − exp

(−‖x − y‖2/104)

are used, respectively. The positive definite kernel function is

kRB F (x, y) = exp(−‖x − y‖2/102) .

123

J. Yang, L. Fan

Table 1 Misclassification rate (%) on data set Dermatology

Kernel w(di j ) W-IKDA/GSVD W-PIKDA W-NIKDA W-RSIKDA

M1 M2 M3 M4 M5

Rbf w0 5.22 8.83 10.33 7.89 7.89 7.28 5.61 6.56

w1 5.50 6.33 7.56 10.56 8.17 8.11 5.89 6.72

w2 5.50 6.50 7.50 9.33 6.94 7.94 5.67 8.11

w3 5.50 6.50 7.50 9.33 6.94 7.94 5.67 6.56

w4 5.39 6.11 7.33 9.72 9.11 7.17 5.83 8.33

w5 5.56 7.00 7.83 9.94 7.39 7.78 5.78 6.72

Mul w0 2.78 2.94 3.06 3.00 3.06 3.06 2.83 3.22

w1 2.83 2.83 2.89 3.22 2.94 2.89 3.44 3.67

w2 2.89 2.83 2.83 3.22 2.94 2.89 3.22 4.00

w3 2.94 2.78 2.67 2.83 3.17 2.83 2.89 3.28

w4 2.89 2.89 2.67 2.94 2.94 2.89 2.83 3.28

w5 2.67 2.83 3.00 3.11 3.22 2.89 3.17 4.17

Epan w0 4.33 4.67 4.94 3.83 4.56 4.17 20.11 4.61

w1 4.39 4.67 4.56 4.83 4.50 4.17 20.89 5.06

w2 4.39 5.11 5.39 3.94 4.56 4.11 19.33 4.61

w3 4.39 5.11 5.39 3.94 4.56 4.11 20.89 4.11

w4 − − − − − − − −w5 4.28 4.56 4.89 4.44 4.72 4.28 20.28 4.33

Gc w0 6.28 9.72 8.22 8.61 8.89 8.11 6.33 28.89

w1 6.06 6.44 7.00 10.72 7.72 7.67 10.11 35.89

w2 6.22 6.94 7.44 11.67 7.11 7.44 6.50 28.56

w3 6.22 6.94 7.44 11.67 7.11 7.44 6.44 27.78

w4 − − − − − − − −w5 6.44 11.33 9.50 7.50 11.00 8.06 6.50 28.11

The average misclassification rates are listed in Table 1, where w0 indicates no weightingfunction is introduced.

Form Table 1, we can see that the indefinite kernels are superior to definite kernel. ForM2 and Multiquadric kernel under weighting functionsw3 andw4, the mean prediction errorrate reaches the least as 2.67, the corresponding indefinite kernel and weighting functionsare optimal for the weighted PIKDA under the matrix M2.

According to Table 1, we have the following conclusion: (1) the results of positive kernelare poor comparing with indefinite kernels, especially, the results of weighted PIKDA forGaussian RBF kernel and Multiquadric kernel. (2) The weighting functions greatly affect themisclassification rate under the same weighted dimension reduction method, for the Gaussiancombination kernel, the weighting functions w2 and w3 produce good overall results, theweighting functionw1 produces the classification accuracy 3.2778 % higher thanw0 for M1.

4.2 Face recognition

In this section, we use the Yale face image database to illustrate our experiments. The Yaleface database contains 165 face images of 15 individuals. There are 11 images per sub-

123

A Novel Indefinite Kernel Dimensionality Reduction Algorithm

Fig. 1 Images of one person in Yale

Table 2 Misclassification rate (%) on data set Yale

kernel w(di j ) W-IKDA/GSVD W-PIKDA W-NIKDA W-RSIKDA

M1 M2 M3 M4 M5

Rbf w0 11.89 11.67 12.22 11.89 12.89 13.44 93.33 12.67

w1 12.56 12.11 12.78 12.44 13.89 12.22 93.33 12.89

w2 12.22 12.56 11.00 12.78 11.67 12.00 93.33 12.11

w3 12.44 12.56 11.00 13.33 12.00 11.89 93.33 11.89

w4 – – – – – – – –

w5 12.89 11.67 11.22 11.78 12.33 13.33 93.33 12.11

Mul w0 12.00 10.44 11.33 11.11 11.56 10.11 11.44 14.33

w1 11.78 10.56 11.56 10.67 10.00 10.78 13.11 17.89

w2 11.78 10.56 11.56 10.67 10.00 10.78 13.11 17.89

w3 11.11 11.56 10.56 11.22 11.00 11.56 13.00 16.56

w4 11.11 11.22 11.22 11.33 10.89 11.67 11.44 14.56

w5 − − − − − − − −Epan w0 9.22 10.56 10.56 10.89 9.11 9.89 10.11 14.44

w1 12.22 10.67 11.22 9.22 9.44 8.56 12.67 14.22

w2 12.56 10.78 10.00 10.11 9.89 9.44 11.44 14.89

w3 12.56 10.56 10.11 9.56 10.33 9.56 11.44 14.78

w4 − − − − − − − −w5 10.22 9.89 8.44 8.67 9.78 10.56 10.00 13.22

Gc w0 9.67 11.44 10.44 10.67 10.44 12.00 10.44 13.11

w1 11.56 11.22 10.33 10.22 10.00 10.33 12.33 16.33

w2 11.33 10.78 11.33 9.78 10.11 11.67 12.89 14.78

w3 11.33 10.78 11.33 9.78 10.11 11.67 12.89 14.78

w4 − − − − − − − −w5 11.56 11.11 10.11 10.56 9.89 10.00 10.44 13.33

ject, and these 11 images are respectively under the following different facial expression orconfiguration: center-light, wearing glasses, happy, left-light, wearing no glasses, normal,right-light, sad, sleepy, surprised, and wink. In our experiment, the images are cropped to asize of 32x32, and the gray level values of all images are rescaled to [0 1]. Some images ofone person are shown in Fig. 1.

In this experiment, the three indefinite kernel functions

kMul(x, y) =√

10‖x − y‖2 + 8100,

kEpan(x, y) = 1 − ‖x − y‖2

1583.5,

kGc(x, y) = exp(−‖x − y‖2/104) − exp

(−‖x − y‖2/732)

123

J. Yang, L. Fan

are used. The Gaussian RBF kernel is

kRB F (x, y) = exp(−‖x − y‖2/1016).

Then the main results can be found in Table 2.From Table 2, we can see the indefinite kernel functions improve the classification per-

formance significantly, especially, the weighted NIKDA produces the misclassification rateis 93.33 on the Gaussian RBF kernel, but when we choose the indefinite kernels, the mis-classification rate can decrease to 10.

According to Table 2, we have the following conclusions: (1) for the same reductiondimensional method, the indefinite kernels have a great influence on the prediction errorrate. (2) Following the same reduction dimensional method, we can know that the weightingfunctions impact the performance of the weighted generalized IKDA methods, most of theaccuracies of the generalized IKDA are much lower than those of weighted generalizedIKDA algorithms. The weighted generalized IKDA methods can successfully control thecontribution of each pair of class to the discrimination and hence can improve the accuracysignificantly if we choose the suitable weighting functions.

5 Conclusion

In this paper, based on pseudo-inverse IKDA, IKDA/GSVD, null space IKDA and rangespace IKDA, we propose weighted pseudo-inverse IKDA, weighted IKDA/GSVD, weightednull space IKDA and weighted range space IKDA. Not only can these methods deal with thesingularity problem caused by the undersampled problems, they can also improves the IKDAmethods by successfully control the contribution of each pair of class to the discrimination.In order to explain the effective of the proposed methods, we conduct a series of experimentson two different data sets from the UCI Machine Learning Repository and the Yale face data-base. The indefinite kernels showed competitive performances throughout the experiments.Results also show that different kernel functions and different weighting functions affect theclassification accuracy of the proposed methods. Also, the weighting functions can affect thekernel functions on the classification results, some can increase the classification accuracy1–2.2222 %, some can decrease 7 %. Extensive experiments based on the two real-world datasets show that the weighted IKDA methods outperform IKDA methods. And the analysisof weighting functions presented in this paper may be useful in understanding why theseweighted algorithms perform well. We plan to explore this further in the future.

Acknowledgments This work is supported by National Natural Science Foundation of China (10871226),Natural Science Foundation of Shandong Province (ZR2009AL006) and Young and Middle-Aged ScientistsResearch Foundation of Shandong Province(BS2010SF004), P.R. China.

References

1. Yang J, Jin Z, Yang J, Zhang D (2004) The essence of kernel Fisher discriminant: KPCA plus LDA.Pattern Recognit 37:2097–2100

2. Yang J, Frangi AF, Yang JY, Zhang D (2005) KPCA plus LDA: a complete kernel fisher discriminantframework for feature extraction and recognition. IEEE Trans Pattern Anal Mach Intell 27:230–244

3. Park CH, Park H (2005) Nonlinear discriminant analysis using kernel functions and the generalizedsingular value decomposition. SIAM J Matrix Anal Appl 27(1):87–102

4. Loog M, Duin RPW, Haeb-Umbach R (2001) Multiclass linear dimension reduction by weighted pairwiseFisher criteria. IEEE Trans Pattern Anal Mach Intell 23(7):762–766

123

A Novel Indefinite Kernel Dimensionality Reduction Algorithm

5. Duda RO, Hart PE, Stork D (2000) Pattern classification. Wiley, New York6. Goldfarb L (1985) A new approach to pattern recognition. In: Kanal L, Rosenfeld A (eds) Progress in

pattern recognition, vol 2. Elsevier, New York, p 241–4027. Pekalska E, Duin R (2005) The dissimilarity representation for pattern recognition. Foundations and

applications. World Scientific, Singapore8. Bognar J (1974) Indefinite inner product space. Springer, New York9. Pekalska E, Haasdonk B (2009) Kernel discriminant analysis for positive definite and indefinite kernels.

IEEE Trans Pattern Anal Mach Intell 31(6):1017–103110. Howland P, Jeon M, Park H (2003) Structure preserving dimension reduction for clustered text data based

on the generalized singular value decomposition. SIAM J Matrix Anal Appl 25(1):165–17911. Park CH, Park H (2008) A comparision of generalized linear discriminant analysis algorithms. Pattern

Recognit 41:1083–109712. Chen L, Liao HM, Ko M, Lin J, Yu G (2000) A new LDA-based face recognition system which can solve

the small sample size problem. Pattern Recognit 33:1713–172613. Lotlikar R, Kothari R (2000) Fractional-step dimensionality reduction. IEEE Trans Pattern Anal Mach

Intell 22(6):623–62714. Liang Y et al (2007) Uncorrelated linear discriminant analysis based on weighted pairwise Fisher criterion.

Pattern Recognit 40:3606–361515. Asuncion A, Newman D (2007) UCI machine learning repository. School of Information and Computer

Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html. AccessedSep 2012

16. Qiao L, Chen S, Tan X (2010) Sparsity preserving projections with applications to face recognition.Pattern Recognit 43(1):331–341

17. Mierswa I, Morik K (2008) About the non-convex optimization problem induced by non-positive semi-definite kernel learning. Adv Data Anal Classif 2:241–258

18. Dai G, Yeung DY, Qian YT (2007) Face recognition using a kernel fractional-step discriminant analysisalgorithm. Pattern Recognit 40:229–243

123