Data Mining and Statistical Learning - 2008

Data Mining and Statistical Learning - 2008

1

Kernel methods- overview

Kernel smoothers

Local regression

Kernel density estimation

Radial basis functions


2

Introduction

Kernel methods are regression techniques used to estimate a response function

from noisy data

Properties:

• Different models are fitted at each query point, and only those observations close to that point are used to fit the model

• The resulting function is smooth

• The models require only a minimum of training

dRXXfy ),(


3

A simple one-dimensional kernel smoother

where

N

ii

N

iii

xxK

yxxKxf

10

10

0

,

,ˆ

otherwise,0

|| if ,1 00

xx

xxK

4.9

5

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

6

0 5 10 15 20 25

Observed Fitted


4

Kernel methods, splines and ordinary least squares regression (OLS)

• OLS: A single model is fitted to all data

• Splines: Different models are fitted to different subintervals (cuboids) of the input domain

• Kernel methods: Different models are fitted at each query point


5

Kernel-weighted averages and moving averages

The Nadaraya-Watson kernel-weighted average

where indicates the window size and the function D shows how the weights change with distance within this window

The estimated function is smooth!

K-nearest neighbours

The estimated function is piecewise constant!

N

ii

N

iii

xxK

yxxKxf

10

10

0

,

,ˆ

))(|()(ˆ xNxyAvexf kii

0xx

DK


6

Examples of one-dimesional kernel smoothers

• Epanechnikov kernel

• Tri-cube kernel

otherwise

tifttD

0

1143

)(2

otherwise

tifttD0

11)(33


7

Issues in kernel smoothing

• The smoothing parameter λ has to be defined

• When there are ties at xi : Compute an average y value and introduce weights representing the number of points

• Boundary issues

• Varying density of observations:

– bias is constant– the variance is inversely proportional to the density


8

Boundary effects of one-dimensionalkernel smoothers

Locally-weighted averages can be badly biased on the boundaries if the response function has a significant slope apply local linear regression


9

Local linear regression

Find the intercept and slope parameters solving

The solution is a linear combination of yi:


10

Kernel smoothing vs local linear regression

Kernel smoothing

Solve the minimization problem

Local linear regression

Solve the minimization problem

N

iiixa xyxxK

1

200)( )]()[,(min

0

N

iiiixxa xxxyxxK

1

2000)(),( ])()()[,(min

00


11

Properties of local linear regression

• Automatically modifies the kernel weights to correct for bias

• Bias depends only on the terms of order higher than one in the expansion of f.


12

Local polynomial regression

• Fitting polynomials instead of straight lines

Behavior of estimated response function:


13

Polynomial vs local linear regression

Advantages:

• Reduces the ”Trimming of hills and filling of valleys”

Disadvantages:

• Higher variance (tails are more wiggly)


14

Selecting the width of the kernel

Bias-Variance tradeoff:

Selecting narrow window leads to high variance and low bias whilst selecting wide window leads to high bias and low variance.


15

Selecting the width of the kernel

1. Automatic selection ( cross-validation)

2. Fixing the degrees of freedom

ijij xlSS ,ˆ yf

Stracedf


16

Local regression in RP

The one-dimensional approach is easily extended to p dimensions by

• Using the Euclidian norm as a measure of distance in the kernel.

• Modifying the polynomial

,,,,,,1 2221

2121 XXXXXXXb


17

Local regression in RP

”The curse of dimensionality”

• The fraction of points close to the boundary of the input domain increases with its dimension

• Observed data do not cover the whole input domain


18

Structured local regression models

Structured kernels (standardize each variable)

Note: A is positive semidefinite


19


Structured regression functions

• ANOVA decompositions (e.g., additive models)

Backfitting algorithms can be used

• Varying coefficient models (partition X)

• INSERT FORMULA 6.17


20


Varying coefficient

models (example)


21

Local methods

• Assumption: model is locally linear ->maximize the log-likelihood locally at x0:

• Autoregressive time series. yt=β0+β1yt-1+…+ βkyt-k+et ->

yt=ztT β+et. Fit by local least-squares with kernel K(z0,zt)


22

Kernel density estimation

• Straightforward estimates of the density are bumpy

• Instead, Parzen’s smooth estimate is preferred:

Normally, Gaussian kernels are used


23

Radial basis functions and kernels

Using the idea of basis expansion, we treat kernel functions as basis functions:

where ξj –prototype parameter, λj-scale parameter


24

Radial basis functions and kernels

Choosing the parameters:

•

• Estimate {λj, ξj } separately from βj (often by using the distribution of X alone) and solve least-squares.

Data Mining and Statistical Learning - 2008

Documents

Transcript of Data Mining and Statistical Learning - 2008