Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of...

53
Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University E-mail: [email protected]

Transcript of Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of...

Basics of Kernel Methods in Statistical Learning Theory

Mohammed Nasser

ProfessorDepartment of Statistics

Rajshahi University

E-mail: [email protected]

ContentsGlimpses of Historical Development

Definition and Examples of Kernel

Some Mathematical Properties of Kernels

Construction of Kernels

Heuristic Presentation of Kernel Methods

Meaning of Kernels

Mercer Theorem and Its Latest Development

Direction of Future Development

Conclusion 2

 

Jerome H. Friedman Vladimir Vapnik

 

Computer Scientists’ Contribution to Statistics: Kernel Methods

3

Early History

In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century.

In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that

A solution exists

The solution is unique

The solution depends continuously on the data, in some reasonable topology

( Well-Posed Problem)4

Early History In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative.

During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics.

Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model.

Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive.

Let Us See What KM present…………….

Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then. Result: a class of algorithms for Pattern Recognition(Kernel Machines)Now: a large and diverse community, from machinelearning, optimization, statistics, neural networks,functional analysis, etcCentralized website: www.kernel-machines.orgFirst Text book (2000): see www.support-vector.net Now ( 2012): At least twenty books of different taste are avialable in international marketThe book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years.

Recent History

6

History MoreDavid Hilbert used the German word ‘kern’ in his first paper on integral equations(Hilbert 1904).The mathematical result underlying the kernel trick, Mercer's theorem, is almost a century old (Mercer 1909). It tells us that any `reasonable' kernel function corresponds to some feature space. which kernels can be used to compute distances in feature spaces was developed by Schoenberg (1938).The methods for representing kernels in linear spaces were first studied by Kolmogorov (1941) for a countable input domain.The method for representing kernels in linear spaces for the general case was developed by Aronszajn (1950).Dunford and Schwartz (1963) showed that Mercer's theorem also holds true for general compact spaces.T

7

History MoreThe use of Mercer's theorem for interpreting kernels as inner products in a feature space was introduced into machine learning by Aizerman, Braverman and Rozonoer (1964)Berg, Christensen and Ressel (1984) published a good monograph on the theory of kernels.Saitoh (1988) showed the connection between positivity (a `positive matrix‘ defined in Aronszajn (1950)) and the positive semi-definiteness of all finite set kernel matrices.Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi, see for example Poggio and Girosi (1990), a paper on radial basis function networks. The theory of kernels was used in approximation and regularization theory, and the first chapter of Spline Modelsfor Observational Data (Wahba 1990) gave a number of theoretical results on kernel functions.

8

The common characteristic (structure) among the following statistical methods?

1. Principal Components Analysis2. (Ridge ) regression3. Fisher discriminant analysis4. Canonical correlation analysis5.Singular value decomposition6. Independent component analysis

Kernel methods: Heuristic View

We consider linear combinations of input vector: ( ) Tf x w x

We make use concepts of length and dot product available in Euclidean space.

KPCA

SVR

KFDAKCCA

KICA

9

10

• Linear learning typically has nice properties– Unique optimal solutions, Fast learning algorithms– Better statistical analysis

• But one big problem– Insufficient capacity

That means, in many data sets it fails to detect nonlinearship among the variables.

• The other demerits

- Cann’t handle non-vectorial data

Kernel methods: Heuristic View

10

VectorsCollections of featurese.g. height, weight, blood pressure, age, . . .Can map categorical variables into vectorsMatricesImages, MoviesRemote sensing and satellite data (multispectral)StringsDocumentsGene sequencesStructured ObjectsXML documentsGraphs

Data

11

Kernel methods: Heuristic ViewGenome-wide data

mRNA expression data

protein-protein interaction

data

hydrophobicity data

sequence data

(gene, protein)

12

13

Original Space Feature Space

Kernel methods: Heuristic View

14

Definition of Kernels

Definition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite.

:k x y R

0T K

It is a generalized dot product

It is not generally bilinear

But it obeys C-S inequality

15

Theorem(Aronszajn,1950): A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property.

:k x y R ( , ) ( ), ( )k x y x y ( )x

( )x x F

We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map ! If the map is needed we may take help of MERCER THEOREM

Kernel Methods: Basic IdeasProper Kernel

( , ) ( ), ( )k x y x y Is always a kernel. When is the converse true?

16

:

1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.

some kernels:2( || || / )

2 2

( , )

( , ) ( , )

( , ) tanh( , )

1( , )

|| ||

x y c

d

k x y e

k x y x y

k x y x y

k x yx y c

some kernel algorithms:- support vector machine- Fisher discriminant analysis- kernel regression- kernel PCA- kernel CCA

Kernel methods consist of two modules

17

Kernel ConstructionThe set of kernels forms a closed convex cone

18

19

Reproducing Kernel Hilbert Space• Reproducing kernel Hilbert space (RKHS)

X: set. A Hilbert space H consisting of functions on X is called a reproducing kernel Hilbert space (RKHS) if the evaluation functional

is continuous for each

– A Hilbert space H consisting of functions on X is a RKHS if and only if there exists (reproducing kernel) such that

(by Riesz’s lemma)

H ),( xk

)(),,( xffxk H, .f x X H

)(,: xffex RH

x X

20

Reproducing Kernel Hilbert Space II

Theorem (construction of RKHS) If k: X x X R is positive definite, there uniquely exists a RKHS Hk on X such that

(1) for all

(2) the linear hull of is dense in Hk ,

(3) is a reproducing kernel of Hk, i.e.,

{ ( , ) | }k x x X

,x XH ),( xk

),( xk

)(),,( xffxkk

H

, .k

f x X H

At this moment we put no structure on X. To have bettter properties of members of g in we have to put extra structure on X and assume additional properties of K/

21

Classification

X ! YAnything:

• continuous (, d, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string,

…)

• …

• discrete:

– {0,1} binary

– {1,…k} multi-

class

– tree, etc.

structured

Y=g(X)

22

Classification

XAnything:

• continuous (, d, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string, …)

• …

Perceptron

Logistic Regression

Support Vector Machine

Decision TreeDecision TreeRandom ForestRandom Forest

Kernel trickKernel trick

23

Regression

X ! YAnything:

• continuous (, d, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string,

…)

• …

Y=g(X)

• continuous:– , d

Not Always

24

Regression

XAnything:

• continuous (, d, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string, …)

• …

Perceptron

Normal Regression

Support Vector regression

GLMGLM

Kernel trickKernel trick

25

Steps for Kernel Methods

DA

TA

MA

TR

IX

Kernel Matrix,

K=

[k(xi ,xj)]

A positive semi definite matrix

Algorithm

f(x)=∑

αik(x

i, x)

Pattern

fun

ction

what K????

Traditional or non

traditional

Why p.s.d??

Kernel Methods: Heuristic View

26

Original Space Feature Space

Kernel methods: Heuristic View

27

The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:

Kernel Methods: Basic Ideas

( , ) ( ), ( )i j i i

k x x x x

The expectation is that the feature space has a much higher dimension than the input space. Feature space has a inner-product like

28

• So kernel methods use linear functions in a feature space:

• For regression this could be the function• For classification require thresholding

Kernel methods: Heuristic ViewForm of functions

29

: ( ), dx x R F

non-linear mapping to F 1. high-D space2. infinite-D countable space :3. function space (Hilbert space)

2L

2 2( , ) ( , , 2 )x y x y xyexample:

Kernel methods: Heuristic ViewFeature spaces

30

• Consider the mapping

• Let us consider a linear equation in this feature space:

• We actually have an ellipse – i.e. a non-linear shape in the input space.

Kernel methods: Heuristic ViewExample

2 2

1 1 2 2 1 20. 0.ax xx xx bx c 31

Ridge Regression (duality)2 2

1

min ( ) || ||Tw i i

i

y w x w

regularizationtarget input

1

1

1

1

( )

( )

( ) ,

T Td

T T

Tij i j

i ii

w X X I X y

X XX I y

X G I y G x x

x

problem:

solution: dxd inverse

inverse

Inner product of obs.

Dual Representationlinear comb. data

f(x)=wTx

= ∑αi(xi,x)

Kernel methods: Heuristic View

32

Note: In the dual representation we used the Gram matrix to express the solution.

Kernel Trick: Replace : ( ),

, ( ), ( ) ( , )ij i j ij i j i j

x x

G x x G x x K x x

kernel

If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features Φ(x)

Kernel methods” Heuristic ViewKernel trick

33

Gist of Kernel methods

Choice of a Kernel Function.

Through choice of a kernel function we choose a Hilbert space.

We then apply the linear method in this new space without increasing computational complexity using mathematical niceties of this space.

34

Kernels to Similarity

• Intuition of kernels as similarity measures:

• When the diagonal entries of the Kernel Gram Matrix are constant, kernels are directly related to similarities.– For example Gaussian Kernel

– In general, it is useful to think of a kernel as a similarity measure.

2

))(),((||)(||||)(||),(

222 xxdxxxxk

)2

||||exp(),(

2

2

xx

xxKG

35

• Distance between two points x1 and x2 in feature space:

Kernels to Distance

1 2 1 2 1 1 2 2 1 2( , ) ( ) ( ) ( , ) ( , ) 2 ( , )d x x x x k x x k x x k x x

• Distance between two points x1 and S in feature space:

21 1 1 11 1 1

2 1( , ) ( , ) ( , ) ( , )

n n n

i i ji j i

d x S k x x k x x k x xn n

36

Kernel methods: Heuristic ViewGenome-wide data

mRNA expression

data

protein-protein interaction

data

hydrophobicity data

sequence data

(gene, protein)

37

How can we make it positive semidefinite if it is not semidefinite?

1 0.6 0.3

0.6 1 5.0

0.3 0.5 1

33k

Similarity to Kernels

38

From Similarity Scores to Kernels

.0

and ),,,,diag(Σ where,

n1rr21

n21

TUUS

Removal of negative eigenvaluesForm the similarity matrix S, where the (i,j)-th entry of S denotes the similarity between the i-th and j-th data points. S is symmetric, but is in general not positive semi-definite, i.e., S has negative eigenvalues.

,0).,0,,,,diag(Σ where, r21

TUUK39

From Similarity Scores to Kernels

t1 t2 - - -- - -

tn

x1

x2 s2m

---

xn

t1 t2 - - -- - -

tn

x1

x2 s2m

---

xn40

Problems of empirical risk minimization

Kernels as Measures of Function Regularity

Empirical Risk functional, =

1

( , , ( )) ( , )

( , , ( )) ( / )

1( , , ( ))

n

X Y

n n X

X Y

n

i i ii

L x y g x dP x y

L x y g x dP y x dP

L x y g xn

,( )

nL PR g

41

What Can We Do?

We can restrict the set of functions over which we minimize empirical risk functionals

modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two.

Stru

ctural risk

Min

imizatio

n

Regularization

42

43

Best Approximation

f

44

Best approximation

• Assume is finite dimensional with basis {k1,......,km}

i.e., =a1k1+…….+a kn

gives m conditions (i=1,…,m) <ki , f- (a1k1+…….+a km)>=0

i.e. <ki,f>- ai<ki,k1>-…….. an <ki,km>=0

1 2, , , mk k k

ˆ f f H

1 2, , , mk k kf̂

ˆ f f

45

RKHS approximation

m conditions become:

We can then estimate the parameters using:a=K-1y

In practice it can be ill-conditioned, so we minimise:

Yi - ai<ki,k1>-…….. an <ki,km>=0

2

2

1

ˆ ( ) , 0

m

i ii

f x z f

kk

m

i i Hf H i

min[ ( f ( x ) y ) f ,

2

1

0

a=(K+λI)-1y46

Approximation vs estimation

Targetspace

Hypothesisspace

Truefunction

Estimate

Best possibleestimate

47

How to choose kernels?

• There is no absolute rule for choosing the right kernel, adapted to a particular problem.

• Kernel should capture the desired similarity.– Kernels for vectors: Polynomial and Gaussian kernel – String kernel (text documents) – Diffusion kernel (graphs)– Sequence kernel (protein, DNA, RNA)

48

Kernel Selection

• Ideally select the optimal kernel based on our prior knowledge of the problem domain.

• Actually, consider a family of kernels defined in a way that again reflects our prior expectations.

• Simple way: require only limited amount of additional information from the training data.

• Elaborate way: Combine label information

49

50

Future DevelopmentMathematics:

Generalization of Mercer Theorem for pseudo metric spaces

Development of mathematical tools for multivariate regression

Statistics:

Application of kernels in multivariate data depth

Application of ideas of robust statistics

Application of these methods in circular data

They can be used to study nonlinear time series 51

• http://www.kernel-machines.org/– Papers, software, workshops, conferences, etc.

AcknowledgementJieping Ye

Department of Computer Science and Engineering

Arizona State Universityhttp://www.public.asu.edu/~jye02

52

Thank You

53