A Kernel Approach for Learning From Almost Orthogonal Pattern *

26
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al., Proc. 13 th ECML, Aug 19-23, 2002, pp. 511-528.

description

A Kernel Approach for Learning From Almost Orthogonal Pattern *. CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin. * B. Scholkopf et al ., Proc. 13 th ECML , Aug 19-23, 2002, pp. 511-528. Presentation Outline. Introduction Motivation - PowerPoint PPT Presentation

Transcript of A Kernel Approach for Learning From Almost Orthogonal Pattern *

Page 1: A Kernel Approach for Learning From Almost Orthogonal Pattern *

A Kernel Approach for

Learning From Almost

Orthogonal Pattern*

CIS 525 Class PresentationProfessor: Slobodan VuceticPresenter: Yilian Qin

* B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.

Page 2: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Presentation Outline Introduction

Motivation A Brief review of SVM for linearly separable patterns Kernel approach for SVM Empirical kernel map

Problem: almost orthogonal patterns in feature space An example Situations leading to almost orthogonal patterns

Method to reduce large diagonals of Gram matrix Gram matrix transformation An approximate approach based on statistics

Experiments Artificial data

(String classification, Microarray data with noise, Hidden variable problem) Real data

(Thrombin binding, Lymphoma classification, Protein family classification)

Conclusions Comments

Page 3: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Introduction

Page 4: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Motivation

Support vector machine (SVM) Powerful method for classification (or regression)

with high accuracy comparable to neural network Exploit of kernel function for pattern separation in

high dimensional space The information of training data for SVM is stored

in the Gram matrix (kernel matrix)

The problem: SVM doesn’t perform well if Gram matrix has large

diagonal values

Page 5: A Kernel Approach for Learning From Almost Orthogonal Pattern *

A Brief Review of SVM

Minimize:

Constraints:

2w

1)( by iT

i xw

For linearly separable patterns:

To maximize the margin:

─ ─

──

++

++

+

+

+

+

─depends on closest points

margin

1 bxwTiiy

1iy

1iy

w

2

Page 6: A Kernel Approach for Learning From Almost Orthogonal Pattern *

For linearly non-separable patterns Nonlinear mapping function (x)H:

mapping the patterns to new feature space H of higher dimension For example: the XOR problem SVM in the new feature space:

The kernel trick: Solving the above minimization problem requires:

1) Explicit form of 2) Inner product in high dimensional space H

Simplification by wise selection of kernel functions with property:

k(xi, xj) = (xi) (xj)

Kernel Approach for SVM (1/3)

Minimize:

Constraints:

2w

1])([ by iT

i xw

Page 7: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Transform the problem with kernel method Expand w in the new feature space: w = ai(xi) = [(x)]a

where [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T

Gram matrix: K=[Kij], where Kij = (xi) (xj) = k(xi, xj) (symmetric !)

The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite)

The constraints:

yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} =

yi{aTKi + b} 1, where Ki is the ith column of K.

Kernel Approach for SVM (2/3)

Minimize:

Constraints:

KaaT

1][ by iT

i Ka

Page 8: A Kernel Approach for Learning From Almost Orthogonal Pattern *

To predict new data with a trained SVM

The explicit form of k(xi, xj) is required for prediction of new data

Kernel Approach for SVM (3/3)

bkkk

b

bf

Ttestmtesttest

T

testT

mT

testT

test

),(),...,,(),,(

)()(),...,(),(

)()(

21

11

xxxxxxa

xxxxa

xwx

Where: a and b are optimal solution based on training data, and m is the number if training data

Page 9: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Assumption: m (the number if instances) is a sufficient high dimension of the

new feature space. i.e. the patterns will be linearly separable in m-dimension

space (Rm)

Empirical kernel map: m(xi) = [k(xi,x1), k(xi,x2), …, k(xi,xm)]T = Ki

The SVM in Rm

The new Gram matrix Km associated with m(x):

Km=[Kmij], where Kmij = m(xi) m(xj) = Ki Kj = KiTKj, i.e. Km = KTK = KKT

Advantage of empirical kernel map: Km is positive definite Km = KKT = (UTDU) (UTDU)T = UTD2U (K is symmetric, U is unitary matrix, D is diagonal)

Satisfied the sufficient condition of above minimization problem

Empirical Kernel Mapping

Minimize:

Constraints:

2w

1])([ by imT

i xw

Page 10: A Kernel Approach for Learning From Almost Orthogonal Pattern *

The Problem:

Almost Orthogonal Patterns in the Feature Space

Result in Poor Performance

Page 11: A Kernel Approach for Learning From Almost Orthogonal Pattern *

An Example of Almost Orthogonal Patterns The training dataset with almost

orthogonal patterns

The Gram matrix with linear kernel

k(xi, xj) = xi xj

w is the solution with standard SVM

Observation: each large entry in w is corresponding to a column in X with only

one large entry: w becomes a lookup table, the SVM won’t generalize well

A better solution:

1

1

1

1

1

1

Y ,

9000000000

0008000000

0000000900

0090000001

0000800001

0000009001

X

8100000

0640000

0081000

0008211

0001651

0001182

K

02.0 ,)11.0,0,11.0,12.0,12.0,0,11.0,11.0,0,04.0( bTw

1 ,)0,0,0,0,0,0,0,0,0,2( bTw

Large Diagonals

Page 12: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Sparsity of the patterns in the new feature space, e.g. x = [ 0, 0, 0, 1, 0, 0, 1, 0]T

Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T

x x y y >> x y (large diagonals in Gram matrix)

Some selection of kernel functions may result in sparsity

in the new feature space String kernel (Watkins 2000, et al)

Polynomial kernel, k(xi, xj) = (xixj)d, with large order d

If xi xi > xi xj , for ij, then

k(xi, xi) >> k(xi, xj), for even moderately large d, due to the

exponential function.

Situations Leading to Almost Orthogonal Patterns

Page 13: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Methods to Reduce the Large Diagonals of Gram Matrices

Page 14: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Gram Matrix Transformation (1/2) For symmetric, positive definite Gram matrix K (or Km),

K = UTDU U is unitary matrix, D is diagonal matrix

Define f(K) = UTf(D)U, and f(D)ii = f(Dii)

i.e., the function f operates on the eigenvalues i of K

f(K) should preserve positive definition of the Gram matrix

A sample procedure for Gram matrix transformation (Optional) Compute the positive definite matrix A = sqrt(K) Suppress the large diagonals of A, and obtain a symmetric A’

i.e. transform the eigenvalues of A:

[min, max] [f(min ), f(max )]

Compute the positive definite matrix K’=(A’)2

)(

...

)(

)(

)( 2

1

mf

f

f

Df

Page 15: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Gram Matrix Transformation (2/2)

Effect of matrix transformation The explicit form of new kernel

function k’ is not available k’ is required when the trained SVM

is used to predict the testing data A solution: include all test data into K

before the matrix transformation K->K’ i.e. the testing data has to be known in training time

(x) K

k(xi,xj)=(xi) (xj)

f(K)

K’

Implicit transformation

’(x)k’(xi,xj) =’(xi) ’(xj)

b

b

bf

iT

iT

nmT

iT

i

Ka

xxxxa

xwx

)()(),...,(),(

)()(

21

'' )( ' bf iT

i Kax

If xi has been used in calculating K’,the prediction on xi can simply use K’i

a’ and b’ from the portion of K’corresponding to the training data

i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data

K’=f(K)

Page 16: A Kernel Approach for Learning From Almost Orthogonal Pattern *

The empirical kernel map m+n(x) should be used to calculate the Gram matrix

Assuming the dataset size r is large

Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)

An Approximate Approach based on Statistics

)"()",'()",()'()(1

)"()",'()",(),'(),(

)],'(),...,'(),,'([)],(),...,(),,([

)'()(

1

2121

xdPxxkxxkxxr

xrdPxxkxxkxxkxxk

xxkxxkxxkxxkxxkxxk

xx

X

rr

X

r

iii

rr

rr

)'()(1

)'()(1

xxnm

xxm nmnmmm

Page 17: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Experiment Results

Page 18: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Artificial Data (1/3)

String classification String kernel function (Watkins 2000, et al) Sub-polynomial kernel k(x,y) = [(x) (y)]P, 0<P<1

for sufficiently small P, the large diagonals of K can be suppressed

50 strings (25 for training, and 25 for testing), 20 trials

Page 19: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Artificial Data (2/3)

Microarray data with noise (Alon et al, 1999) 62 instance (22 positive, 44 negative), 2000 features in original

data 10000 noise features were added (1% to be non-zero in probability)

Error rate for SVM without noise addition is: 0.180.15

Page 20: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Artificial Data (3/3)

Hidden variable problem 10 hidden variables (attributes), 10 additional attributes which are

nonlinear functions of the 10 hidden variables Original kernel is polynomial kernel of order 4

Page 21: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Real Data (1/3)

Thrombin binding problem 1909 instances, 139,351 binary features 0.68% entries are non-zero 8-fold cross validation

Page 22: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Real Data (2/3)

Lymphoma classification (Alizadeh et al, 2000) 96 samples, 4026 features 10-fold cross validation Improved results observed compared with previous work (Weston,

2001)

Page 23: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Real Data (3/3)

Protein family classification (Murzin et al, 1995) Small positive set, large negative set

Rate of false positive

Receiver operating characteristic1: best score0: worst score

Page 24: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Conclusions

Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed

The common situation that sparse vectors leading to large diagonals was identified and discussed

A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases

Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

Page 25: A Kernel Approach for Learning From Almost Orthogonal Pattern *

Comments Strong points:

The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals

Experiments are extensive

Weak points: The application of Gram matrix transformation may be severely

restricted in forecasting or other applications in which the testing data is not know in training time

The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments

The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

Page 26: A Kernel Approach for Learning From Almost Orthogonal Pattern *

End!