Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of...
-
Upload
johana-phlipot -
Category
Documents
-
view
220 -
download
1
Transcript of Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of...
Basics of Kernel Methods in Statistical Learning Theory
Mohammed Nasser
ProfessorDepartment of Statistics
Rajshahi University
E-mail: [email protected]
ContentsGlimpses of Historical Development
Definition and Examples of Kernel
Some Mathematical Properties of Kernels
Construction of Kernels
Heuristic Presentation of Kernel Methods
Meaning of Kernels
Mercer Theorem and Its Latest Development
Direction of Future Development
Conclusion 2
Jerome H. Friedman Vladimir Vapnik
Computer Scientists’ Contribution to Statistics: Kernel Methods
3
Early History
In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century.
In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that
A solution exists
The solution is unique
The solution depends continuously on the data, in some reasonable topology
( Well-Posed Problem)4
Early History In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative.
During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics.
Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model.
Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive.
Let Us See What KM present…………….
Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then. Result: a class of algorithms for Pattern Recognition(Kernel Machines)Now: a large and diverse community, from machinelearning, optimization, statistics, neural networks,functional analysis, etcCentralized website: www.kernel-machines.orgFirst Text book (2000): see www.support-vector.net Now ( 2012): At least twenty books of different taste are avialable in international marketThe book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years.
Recent History
6
History MoreDavid Hilbert used the German word ‘kern’ in his first paper on integral equations(Hilbert 1904).The mathematical result underlying the kernel trick, Mercer's theorem, is almost a century old (Mercer 1909). It tells us that any `reasonable' kernel function corresponds to some feature space. which kernels can be used to compute distances in feature spaces was developed by Schoenberg (1938).The methods for representing kernels in linear spaces were first studied by Kolmogorov (1941) for a countable input domain.The method for representing kernels in linear spaces for the general case was developed by Aronszajn (1950).Dunford and Schwartz (1963) showed that Mercer's theorem also holds true for general compact spaces.T
7
History MoreThe use of Mercer's theorem for interpreting kernels as inner products in a feature space was introduced into machine learning by Aizerman, Braverman and Rozonoer (1964)Berg, Christensen and Ressel (1984) published a good monograph on the theory of kernels.Saitoh (1988) showed the connection between positivity (a `positive matrix‘ defined in Aronszajn (1950)) and the positive semi-definiteness of all finite set kernel matrices.Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi, see for example Poggio and Girosi (1990), a paper on radial basis function networks. The theory of kernels was used in approximation and regularization theory, and the first chapter of Spline Modelsfor Observational Data (Wahba 1990) gave a number of theoretical results on kernel functions.
8
The common characteristic (structure) among the following statistical methods?
1. Principal Components Analysis2. (Ridge ) regression3. Fisher discriminant analysis4. Canonical correlation analysis5.Singular value decomposition6. Independent component analysis
Kernel methods: Heuristic View
We consider linear combinations of input vector: ( ) Tf x w x
We make use concepts of length and dot product available in Euclidean space.
KPCA
SVR
KFDAKCCA
KICA
9
10
• Linear learning typically has nice properties– Unique optimal solutions, Fast learning algorithms– Better statistical analysis
• But one big problem– Insufficient capacity
That means, in many data sets it fails to detect nonlinearship among the variables.
• The other demerits
- Cann’t handle non-vectorial data
Kernel methods: Heuristic View
10
VectorsCollections of featurese.g. height, weight, blood pressure, age, . . .Can map categorical variables into vectorsMatricesImages, MoviesRemote sensing and satellite data (multispectral)StringsDocumentsGene sequencesStructured ObjectsXML documentsGraphs
Data
11
Kernel methods: Heuristic ViewGenome-wide data
mRNA expression data
protein-protein interaction
data
hydrophobicity data
sequence data
(gene, protein)
12
Definition of Kernels
Definition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite.
:k x y R
0T K
It is a generalized dot product
It is not generally bilinear
But it obeys C-S inequality
15
Theorem(Aronszajn,1950): A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property.
:k x y R ( , ) ( ), ( )k x y x y ( )x
( )x x F
We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map ! If the map is needed we may take help of MERCER THEOREM
Kernel Methods: Basic IdeasProper Kernel
( , ) ( ), ( )k x y x y Is always a kernel. When is the converse true?
16
:
1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
some kernels:2( || || / )
2 2
( , )
( , ) ( , )
( , ) tanh( , )
1( , )
|| ||
x y c
d
k x y e
k x y x y
k x y x y
k x yx y c
some kernel algorithms:- support vector machine- Fisher discriminant analysis- kernel regression- kernel PCA- kernel CCA
Kernel methods consist of two modules
17
Reproducing Kernel Hilbert Space• Reproducing kernel Hilbert space (RKHS)
X: set. A Hilbert space H consisting of functions on X is called a reproducing kernel Hilbert space (RKHS) if the evaluation functional
is continuous for each
– A Hilbert space H consisting of functions on X is a RKHS if and only if there exists (reproducing kernel) such that
(by Riesz’s lemma)
H ),( xk
)(),,( xffxk H, .f x X H
)(,: xffex RH
x X
20
Reproducing Kernel Hilbert Space II
Theorem (construction of RKHS) If k: X x X R is positive definite, there uniquely exists a RKHS Hk on X such that
(1) for all
(2) the linear hull of is dense in Hk ,
(3) is a reproducing kernel of Hk, i.e.,
{ ( , ) | }k x x X
,x XH ),( xk
),( xk
)(),,( xffxkk
H
, .k
f x X H
At this moment we put no structure on X. To have bettter properties of members of g in we have to put extra structure on X and assume additional properties of K/
21
Classification
X ! YAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string,
…)
• …
• discrete:
– {0,1} binary
– {1,…k} multi-
class
– tree, etc.
structured
Y=g(X)
22
Classification
XAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
Perceptron
Logistic Regression
Support Vector Machine
Decision TreeDecision TreeRandom ForestRandom Forest
Kernel trickKernel trick
23
Regression
X ! YAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string,
…)
• …
Y=g(X)
• continuous:– , d
Not Always
24
Regression
XAnything:
• continuous (, d, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
Perceptron
Normal Regression
Support Vector regression
GLMGLM
Kernel trickKernel trick
25
Steps for Kernel Methods
DA
TA
MA
TR
IX
Kernel Matrix,
K=
[k(xi ,xj)]
A positive semi definite matrix
Algorithm
f(x)=∑
αik(x
i, x)
Pattern
fun
ction
what K????
Traditional or non
traditional
Why p.s.d??
Kernel Methods: Heuristic View
26
The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:
Kernel Methods: Basic Ideas
( , ) ( ), ( )i j i i
k x x x x
The expectation is that the feature space has a much higher dimension than the input space. Feature space has a inner-product like
28
• So kernel methods use linear functions in a feature space:
• For regression this could be the function• For classification require thresholding
Kernel methods: Heuristic ViewForm of functions
29
: ( ), dx x R F
non-linear mapping to F 1. high-D space2. infinite-D countable space :3. function space (Hilbert space)
2L
2 2( , ) ( , , 2 )x y x y xyexample:
Kernel methods: Heuristic ViewFeature spaces
30
• Consider the mapping
• Let us consider a linear equation in this feature space:
• We actually have an ellipse – i.e. a non-linear shape in the input space.
Kernel methods: Heuristic ViewExample
2 2
1 1 2 2 1 20. 0.ax xx xx bx c 31
Ridge Regression (duality)2 2
1
min ( ) || ||Tw i i
i
y w x w
regularizationtarget input
1
1
1
1
( )
( )
( ) ,
T Td
T T
Tij i j
i ii
w X X I X y
X XX I y
X G I y G x x
x
problem:
solution: dxd inverse
inverse
Inner product of obs.
Dual Representationlinear comb. data
f(x)=wTx
= ∑αi(xi,x)
Kernel methods: Heuristic View
32
Note: In the dual representation we used the Gram matrix to express the solution.
Kernel Trick: Replace : ( ),
, ( ), ( ) ( , )ij i j ij i j i j
x x
G x x G x x K x x
kernel
If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features Φ(x)
Kernel methods” Heuristic ViewKernel trick
33
Gist of Kernel methods
Choice of a Kernel Function.
Through choice of a kernel function we choose a Hilbert space.
We then apply the linear method in this new space without increasing computational complexity using mathematical niceties of this space.
34
Kernels to Similarity
• Intuition of kernels as similarity measures:
• When the diagonal entries of the Kernel Gram Matrix are constant, kernels are directly related to similarities.– For example Gaussian Kernel
– In general, it is useful to think of a kernel as a similarity measure.
2
))(),((||)(||||)(||),(
222 xxdxxxxk
)2
||||exp(),(
2
2
xx
xxKG
35
• Distance between two points x1 and x2 in feature space:
Kernels to Distance
1 2 1 2 1 1 2 2 1 2( , ) ( ) ( ) ( , ) ( , ) 2 ( , )d x x x x k x x k x x k x x
• Distance between two points x1 and S in feature space:
21 1 1 11 1 1
2 1( , ) ( , ) ( , ) ( , )
n n n
i i ji j i
d x S k x x k x x k x xn n
36
Kernel methods: Heuristic ViewGenome-wide data
mRNA expression
data
protein-protein interaction
data
hydrophobicity data
sequence data
(gene, protein)
37
How can we make it positive semidefinite if it is not semidefinite?
1 0.6 0.3
0.6 1 5.0
0.3 0.5 1
33k
Similarity to Kernels
38
From Similarity Scores to Kernels
.0
and ),,,,diag(Σ where,
n1rr21
n21
TUUS
Removal of negative eigenvaluesForm the similarity matrix S, where the (i,j)-th entry of S denotes the similarity between the i-th and j-th data points. S is symmetric, but is in general not positive semi-definite, i.e., S has negative eigenvalues.
,0).,0,,,,diag(Σ where, r21
TUUK39
From Similarity Scores to Kernels
t1 t2 - - -- - -
tn
x1
x2 s2m
---
xn
t1 t2 - - -- - -
tn
x1
x2 s2m
---
xn40
Problems of empirical risk minimization
Kernels as Measures of Function Regularity
Empirical Risk functional, =
1
( , , ( )) ( , )
( , , ( )) ( / )
1( , , ( ))
n
X Y
n n X
X Y
n
i i ii
L x y g x dP x y
L x y g x dP y x dP
L x y g xn
,( )
nL PR g
41
What Can We Do?
We can restrict the set of functions over which we minimize empirical risk functionals
modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two.
Stru
ctural risk
Min
imizatio
n
Regularization
42
Best approximation
• Assume is finite dimensional with basis {k1,......,km}
i.e., =a1k1+…….+a kn
gives m conditions (i=1,…,m) <ki , f- (a1k1+…….+a km)>=0
i.e. <ki,f>- ai<ki,k1>-…….. an <ki,km>=0
1 2, , , mk k k
ˆ f f H
1 2, , , mk k kf̂
ˆ f f
45
RKHS approximation
m conditions become:
We can then estimate the parameters using:a=K-1y
In practice it can be ill-conditioned, so we minimise:
Yi - ai<ki,k1>-…….. an <ki,km>=0
2
2
1
ˆ ( ) , 0
m
i ii
f x z f
kk
m
i i Hf H i
min[ ( f ( x ) y ) f ,
2
1
0
a=(K+λI)-1y46
Approximation vs estimation
Targetspace
Hypothesisspace
Truefunction
Estimate
Best possibleestimate
47
How to choose kernels?
• There is no absolute rule for choosing the right kernel, adapted to a particular problem.
• Kernel should capture the desired similarity.– Kernels for vectors: Polynomial and Gaussian kernel – String kernel (text documents) – Diffusion kernel (graphs)– Sequence kernel (protein, DNA, RNA)
48
Kernel Selection
• Ideally select the optimal kernel based on our prior knowledge of the problem domain.
• Actually, consider a family of kernels defined in a way that again reflects our prior expectations.
• Simple way: require only limited amount of additional information from the training data.
• Elaborate way: Combine label information
49
Future DevelopmentMathematics:
Generalization of Mercer Theorem for pseudo metric spaces
Development of mathematical tools for multivariate regression
Statistics:
Application of kernels in multivariate data depth
Application of ideas of robust statistics
Application of these methods in circular data
They can be used to study nonlinear time series 51
• http://www.kernel-machines.org/– Papers, software, workshops, conferences, etc.
AcknowledgementJieping Ye
Department of Computer Science and Engineering
Arizona State Universityhttp://www.public.asu.edu/~jye02
52