A Kernel Approach for Learning From Almost Orthogonal Pattern *

A Kernel Approach for

Learning From Almost

Orthogonal Pattern*

CIS 525 Class PresentationProfessor: Slobodan VuceticPresenter: Yilian Qin

* B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.

Presentation Outline Introduction

Motivation A Brief review of SVM for linearly separable patterns Kernel approach for SVM Empirical kernel map

Problem: almost orthogonal patterns in feature space An example Situations leading to almost orthogonal patterns

Method to reduce large diagonals of Gram matrix Gram matrix transformation An approximate approach based on statistics

Experiments Artificial data

(String classification, Microarray data with noise, Hidden variable problem) Real data

(Thrombin binding, Lymphoma classification, Protein family classification)

Conclusions Comments

Introduction

Motivation

Support vector machine (SVM) Powerful method for classification (or regression)

with high accuracy comparable to neural network Exploit of kernel function for pattern separation in

high dimensional space The information of training data for SVM is stored

in the Gram matrix (kernel matrix)

The problem: SVM doesn’t perform well if Gram matrix has large

diagonal values

A Brief Review of SVM

Minimize:

Constraints:

2w

1)( by iT

i xw

For linearly separable patterns:

To maximize the margin:

─

─

─ ─

─

──

─

++

++

+

+

+

+

─depends on closest points

margin

1 bxwTiiy

1iy

1iy

w

2

For linearly non-separable patterns Nonlinear mapping function (x)H:

mapping the patterns to new feature space H of higher dimension For example: the XOR problem SVM in the new feature space:

The kernel trick: Solving the above minimization problem requires:

1) Explicit form of 2) Inner product in high dimensional space H

Simplification by wise selection of kernel functions with property:

k(xi, xj) = (xi) (xj)

Kernel Approach for SVM (1/3)

Minimize:

Constraints:

2w

1])([ by iT

i xw

Transform the problem with kernel method Expand w in the new feature space: w = ai(xi) = [(x)]a

where [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T

Gram matrix: K=[Kij], where Kij = (xi) (xj) = k(xi, xj) (symmetric !)

The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite)

The constraints:

yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} =

yi{aTKi + b} 1, where Ki is the ith column of K.


Minimize:

Constraints:

KaaT

1][ by iT

i Ka

To predict new data with a trained SVM

The explicit form of k(xi, xj) is required for prediction of new data


bkkk

b

bf

Ttestmtesttest

T

testT

mT

testT

test

),(),...,,(),,(

)()(),...,(),(

)()(

21

11

xxxxxxa

xxxxa

xwx

Where: a and b are optimal solution based on training data, and m is the number if training data

Assumption: m (the number if instances) is a sufficient high dimension of the

new feature space. i.e. the patterns will be linearly separable in m-dimension

space (Rm)

Empirical kernel map: m(xi) = [k(xi,x1), k(xi,x2), …, k(xi,xm)]T = Ki

The SVM in Rm

The new Gram matrix Km associated with m(x):

Km=[Kmij], where Kmij = m(xi) m(xj) = Ki Kj = KiTKj, i.e. Km = KTK = KKT

Advantage of empirical kernel map: Km is positive definite Km = KKT = (UTDU) (UTDU)T = UTD2U (K is symmetric, U is unitary matrix, D is diagonal)

Satisfied the sufficient condition of above minimization problem

Empirical Kernel Mapping

Minimize:

Constraints:

2w

1])([ by imT

i xw

The Problem:

Almost Orthogonal Patterns in the Feature Space

Result in Poor Performance

An Example of Almost Orthogonal Patterns The training dataset with almost

orthogonal patterns

The Gram matrix with linear kernel

k(xi, xj) = xi xj

w is the solution with standard SVM

Observation: each large entry in w is corresponding to a column in X with only

one large entry: w becomes a lookup table, the SVM won’t generalize well

A better solution:

1

1

1

1

1

1

Y ,

9000000000

0008000000

0000000900

0090000001

0000800001

0000009001

X

8100000

0640000

0081000

0008211

0001651

0001182

K

02.0 ,)11.0,0,11.0,12.0,12.0,0,11.0,11.0,0,04.0( bTw

1 ,)0,0,0,0,0,0,0,0,0,2( bTw

Large Diagonals

Sparsity of the patterns in the new feature space, e.g. x = [ 0, 0, 0, 1, 0, 0, 1, 0]T

Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T

x x y y >> x y (large diagonals in Gram matrix)

Some selection of kernel functions may result in sparsity

in the new feature space String kernel (Watkins 2000, et al)

Polynomial kernel, k(xi, xj) = (xixj)d, with large order d

If xi xi > xi xj , for ij, then

k(xi, xi) >> k(xi, xj), for even moderately large d, due to the

exponential function.

Situations Leading to Almost Orthogonal Patterns

Methods to Reduce the Large Diagonals of Gram Matrices

Gram Matrix Transformation (1/2) For symmetric, positive definite Gram matrix K (or Km),

K = UTDU U is unitary matrix, D is diagonal matrix

Define f(K) = UTf(D)U, and f(D)ii = f(Dii)

i.e., the function f operates on the eigenvalues i of K

f(K) should preserve positive definition of the Gram matrix

A sample procedure for Gram matrix transformation (Optional) Compute the positive definite matrix A = sqrt(K) Suppress the large diagonals of A, and obtain a symmetric A’

i.e. transform the eigenvalues of A:

[min, max] [f(min ), f(max )]

Compute the positive definite matrix K’=(A’)2

)(

...

)(

)(

)( 2

1

mf

f

f

Df

Gram Matrix Transformation (2/2)

Effect of matrix transformation The explicit form of new kernel

function k’ is not available k’ is required when the trained SVM

is used to predict the testing data A solution: include all test data into K

before the matrix transformation K->K’ i.e. the testing data has to be known in training time

(x) K

k(xi,xj)=(xi) (xj)

f(K)

K’

Implicit transformation

’(x)k’(xi,xj) =’(xi) ’(xj)

b

b

bf

iT

iT

nmT

iT

i

Ka

xxxxa

xwx

)()(),...,(),(

)()(

21

'' )( ' bf iT

i Kax

If xi has been used in calculating K’,the prediction on xi can simply use K’i

a’ and b’ from the portion of K’corresponding to the training data

i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data

K’=f(K)

The empirical kernel map m+n(x) should be used to calculate the Gram matrix

Assuming the dataset size r is large

Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)

An Approximate Approach based on Statistics

)"()",'()",()'()(1

)"()",'()",(),'(),(

)],'(),...,'(),,'([)],(),...,(),,([

)'()(

1

2121

xdPxxkxxkxxr

xrdPxxkxxkxxkxxk

xxkxxkxxkxxkxxkxxk

xx

X

rr

X

r

iii

rr

rr

)'()(1

)'()(1

xxnm

xxm nmnmmm

Experiment Results

Artificial Data (1/3)

String classification String kernel function (Watkins 2000, et al) Sub-polynomial kernel k(x,y) = [(x) (y)]P, 0<P<1

for sufficiently small P, the large diagonals of K can be suppressed

50 strings (25 for training, and 25 for testing), 20 trials


Microarray data with noise (Alon et al, 1999) 62 instance (22 positive, 44 negative), 2000 features in original

data 10000 noise features were added (1% to be non-zero in probability)

Error rate for SVM without noise addition is: 0.180.15


Hidden variable problem 10 hidden variables (attributes), 10 additional attributes which are

nonlinear functions of the 10 hidden variables Original kernel is polynomial kernel of order 4

Real Data (1/3)

Thrombin binding problem 1909 instances, 139,351 binary features 0.68% entries are non-zero 8-fold cross validation

Real Data (2/3)

Lymphoma classification (Alizadeh et al, 2000) 96 samples, 4026 features 10-fold cross validation Improved results observed compared with previous work (Weston,

2001)

Real Data (3/3)

Protein family classification (Murzin et al, 1995) Small positive set, large negative set

Rate of false positive

Receiver operating characteristic1: best score0: worst score

Conclusions

Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed

The common situation that sparse vectors leading to large diagonals was identified and discussed

A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases

Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

Comments Strong points:

The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals

Experiments are extensive

Weak points: The application of Gram matrix transformation may be severely

restricted in forecasting or other applications in which the testing data is not know in training time

The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments

The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

A Kernel Approach for Learning From Almost Orthogonal Pattern *

Documents

Transcript of A Kernel Approach for Learning From Almost Orthogonal Pattern *