Almost Random Projection Machine with Margin Maximization and Kernel Features
A Kernel Approach for Learning From Almost Orthogonal Pattern *
-
Upload
xanthus-church -
Category
Documents
-
view
30 -
download
0
description
Transcript of A Kernel Approach for Learning From Almost Orthogonal Pattern *
A Kernel Approach for
Learning From Almost
Orthogonal Pattern*
CIS 525 Class PresentationProfessor: Slobodan VuceticPresenter: Yilian Qin
* B. Scholkopf et al., Proc. 13th ECML, Aug 19-23, 2002, pp. 511-528.
Presentation Outline Introduction
Motivation A Brief review of SVM for linearly separable patterns Kernel approach for SVM Empirical kernel map
Problem: almost orthogonal patterns in feature space An example Situations leading to almost orthogonal patterns
Method to reduce large diagonals of Gram matrix Gram matrix transformation An approximate approach based on statistics
Experiments Artificial data
(String classification, Microarray data with noise, Hidden variable problem) Real data
(Thrombin binding, Lymphoma classification, Protein family classification)
Conclusions Comments
Introduction
Motivation
Support vector machine (SVM) Powerful method for classification (or regression)
with high accuracy comparable to neural network Exploit of kernel function for pattern separation in
high dimensional space The information of training data for SVM is stored
in the Gram matrix (kernel matrix)
The problem: SVM doesn’t perform well if Gram matrix has large
diagonal values
A Brief Review of SVM
Minimize:
Constraints:
2w
1)( by iT
i xw
For linearly separable patterns:
To maximize the margin:
─
─
─ ─
─
──
─
++
++
+
+
+
+
─depends on closest points
margin
1 bxwTiiy
1iy
1iy
w
2
For linearly non-separable patterns Nonlinear mapping function (x)H:
mapping the patterns to new feature space H of higher dimension For example: the XOR problem SVM in the new feature space:
The kernel trick: Solving the above minimization problem requires:
1) Explicit form of 2) Inner product in high dimensional space H
Simplification by wise selection of kernel functions with property:
k(xi, xj) = (xi) (xj)
Kernel Approach for SVM (1/3)
Minimize:
Constraints:
2w
1])([ by iT
i xw
Transform the problem with kernel method Expand w in the new feature space: w = ai(xi) = [(x)]a
where [(x)]=[(x1), (x2), …, (xm)], and a=[a1, a2, … am]T
Gram matrix: K=[Kij], where Kij = (xi) (xj) = k(xi, xj) (symmetric !)
The (squared) objective function:||w||2 = aT[(x)]T[(x)]a = aTKa (sufficient condition for existence of optimal solution: K is positive definite)
The constraints:
yi{wT(xi) + b} = yi{aT[(x)]T(xi) + b} =
yi{aTKi + b} 1, where Ki is the ith column of K.
Kernel Approach for SVM (2/3)
Minimize:
Constraints:
KaaT
1][ by iT
i Ka
To predict new data with a trained SVM
The explicit form of k(xi, xj) is required for prediction of new data
Kernel Approach for SVM (3/3)
bkkk
b
bf
Ttestmtesttest
T
testT
mT
testT
test
),(),...,,(),,(
)()(),...,(),(
)()(
21
11
xxxxxxa
xxxxa
xwx
Where: a and b are optimal solution based on training data, and m is the number if training data
Assumption: m (the number if instances) is a sufficient high dimension of the
new feature space. i.e. the patterns will be linearly separable in m-dimension
space (Rm)
Empirical kernel map: m(xi) = [k(xi,x1), k(xi,x2), …, k(xi,xm)]T = Ki
The SVM in Rm
The new Gram matrix Km associated with m(x):
Km=[Kmij], where Kmij = m(xi) m(xj) = Ki Kj = KiTKj, i.e. Km = KTK = KKT
Advantage of empirical kernel map: Km is positive definite Km = KKT = (UTDU) (UTDU)T = UTD2U (K is symmetric, U is unitary matrix, D is diagonal)
Satisfied the sufficient condition of above minimization problem
Empirical Kernel Mapping
Minimize:
Constraints:
2w
1])([ by imT
i xw
The Problem:
Almost Orthogonal Patterns in the Feature Space
Result in Poor Performance
An Example of Almost Orthogonal Patterns The training dataset with almost
orthogonal patterns
The Gram matrix with linear kernel
k(xi, xj) = xi xj
w is the solution with standard SVM
Observation: each large entry in w is corresponding to a column in X with only
one large entry: w becomes a lookup table, the SVM won’t generalize well
A better solution:
1
1
1
1
1
1
Y ,
9000000000
0008000000
0000000900
0090000001
0000800001
0000009001
X
8100000
0640000
0081000
0008211
0001651
0001182
K
02.0 ,)11.0,0,11.0,12.0,12.0,0,11.0,11.0,0,04.0( bTw
1 ,)0,0,0,0,0,0,0,0,0,2( bTw
Large Diagonals
Sparsity of the patterns in the new feature space, e.g. x = [ 0, 0, 0, 1, 0, 0, 1, 0]T
Y = [ 0, 1, 1, 0, 0, 0 , 0, 0]T
x x y y >> x y (large diagonals in Gram matrix)
Some selection of kernel functions may result in sparsity
in the new feature space String kernel (Watkins 2000, et al)
Polynomial kernel, k(xi, xj) = (xixj)d, with large order d
If xi xi > xi xj , for ij, then
k(xi, xi) >> k(xi, xj), for even moderately large d, due to the
exponential function.
Situations Leading to Almost Orthogonal Patterns
Methods to Reduce the Large Diagonals of Gram Matrices
Gram Matrix Transformation (1/2) For symmetric, positive definite Gram matrix K (or Km),
K = UTDU U is unitary matrix, D is diagonal matrix
Define f(K) = UTf(D)U, and f(D)ii = f(Dii)
i.e., the function f operates on the eigenvalues i of K
f(K) should preserve positive definition of the Gram matrix
A sample procedure for Gram matrix transformation (Optional) Compute the positive definite matrix A = sqrt(K) Suppress the large diagonals of A, and obtain a symmetric A’
i.e. transform the eigenvalues of A:
[min, max] [f(min ), f(max )]
Compute the positive definite matrix K’=(A’)2
)(
...
)(
)(
)( 2
1
mf
f
f
Df
Gram Matrix Transformation (2/2)
Effect of matrix transformation The explicit form of new kernel
function k’ is not available k’ is required when the trained SVM
is used to predict the testing data A solution: include all test data into K
before the matrix transformation K->K’ i.e. the testing data has to be known in training time
(x) K
k(xi,xj)=(xi) (xj)
f(K)
K’
Implicit transformation
’(x)k’(xi,xj) =’(xi) ’(xj)
b
b
bf
iT
iT
nmT
iT
i
Ka
xxxxa
xwx
)()(),...,(),(
)()(
21
'' )( ' bf iT
i Kax
If xi has been used in calculating K’,the prediction on xi can simply use K’i
a’ and b’ from the portion of K’corresponding to the training data
i= 1, 2,…m+n, where m is the number if training dataand n is the number of testing data
K’=f(K)
The empirical kernel map m+n(x) should be used to calculate the Gram matrix
Assuming the dataset size r is large
Therefore, the SVM can be simply trained with the empirical map on the training set, m(x), instead of m+n(x)
An Approximate Approach based on Statistics
)"()",'()",()'()(1
)"()",'()",(),'(),(
)],'(),...,'(),,'([)],(),...,(),,([
)'()(
1
2121
xdPxxkxxkxxr
xrdPxxkxxkxxkxxk
xxkxxkxxkxxkxxkxxk
xx
X
rr
X
r
iii
rr
rr
)'()(1
)'()(1
xxnm
xxm nmnmmm
Experiment Results
Artificial Data (1/3)
String classification String kernel function (Watkins 2000, et al) Sub-polynomial kernel k(x,y) = [(x) (y)]P, 0<P<1
for sufficiently small P, the large diagonals of K can be suppressed
50 strings (25 for training, and 25 for testing), 20 trials
Artificial Data (2/3)
Microarray data with noise (Alon et al, 1999) 62 instance (22 positive, 44 negative), 2000 features in original
data 10000 noise features were added (1% to be non-zero in probability)
Error rate for SVM without noise addition is: 0.180.15
Artificial Data (3/3)
Hidden variable problem 10 hidden variables (attributes), 10 additional attributes which are
nonlinear functions of the 10 hidden variables Original kernel is polynomial kernel of order 4
Real Data (1/3)
Thrombin binding problem 1909 instances, 139,351 binary features 0.68% entries are non-zero 8-fold cross validation
Real Data (2/3)
Lymphoma classification (Alizadeh et al, 2000) 96 samples, 4026 features 10-fold cross validation Improved results observed compared with previous work (Weston,
2001)
Real Data (3/3)
Protein family classification (Murzin et al, 1995) Small positive set, large negative set
Rate of false positive
Receiver operating characteristic1: best score0: worst score
Conclusions
Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed
The common situation that sparse vectors leading to large diagonals was identified and discussed
A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases
Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices
Comments Strong points:
The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals
Experiments are extensive
Weak points: The application of Gram matrix transformation may be severely
restricted in forecasting or other applications in which the testing data is not know in training time
The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments
The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied
End!