Data Mining and Statistical Learning - 2008
Transcript of Data Mining and Statistical Learning - 2008
![Page 1: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/1.jpg)
Data Mining and Statistical Learning - 2008
1
Kernel methods- overview
Kernel smoothers
Local regression
Kernel density estimation
Radial basis functions
![Page 2: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/2.jpg)
Data Mining and Statistical Learning - 2008
2
Introduction
Kernel methods are regression techniques used to estimate a response function
from noisy data
Properties:
• Different models are fitted at each query point, and only those observations close to that point are used to fit the model
• The resulting function is smooth
• The models require only a minimum of training
dRXXfy ),(
![Page 3: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/3.jpg)
Data Mining and Statistical Learning - 2008
3
A simple one-dimensional kernel smoother
where
N
ii
N
iii
xxK
yxxKxf
10
10
0
,
,ˆ
otherwise,0
|| if ,1 00
xx
xxK
4.9
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
6
0 5 10 15 20 25
Observed Fitted
![Page 4: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/4.jpg)
Data Mining and Statistical Learning - 2008
4
Kernel methods, splines and ordinary least squares regression (OLS)
• OLS: A single model is fitted to all data
• Splines: Different models are fitted to different subintervals (cuboids) of the input domain
• Kernel methods: Different models are fitted at each query point
![Page 5: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/5.jpg)
Data Mining and Statistical Learning - 2008
5
Kernel-weighted averages and moving averages
The Nadaraya-Watson kernel-weighted average
where indicates the window size and the function D shows how the weights change with distance within this window
The estimated function is smooth!
K-nearest neighbours
The estimated function is piecewise constant!
N
ii
N
iii
xxK
yxxKxf
10
10
0
,
,ˆ
))(|()(ˆ xNxyAvexf kii
0xx
DK
![Page 6: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/6.jpg)
Data Mining and Statistical Learning - 2008
6
Examples of one-dimesional kernel smoothers
• Epanechnikov kernel
• Tri-cube kernel
otherwise
tifttD
0
1143
)(2
otherwise
tifttD0
11)(33
![Page 7: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/7.jpg)
Data Mining and Statistical Learning - 2008
7
Issues in kernel smoothing
• The smoothing parameter λ has to be defined
• When there are ties at xi : Compute an average y value and introduce weights representing the number of points
• Boundary issues
• Varying density of observations:
– bias is constant– the variance is inversely proportional to the density
![Page 8: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/8.jpg)
Data Mining and Statistical Learning - 2008
8
Boundary effects of one-dimensionalkernel smoothers
Locally-weighted averages can be badly biased on the boundaries if the response function has a significant slope apply local linear regression
![Page 9: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/9.jpg)
Data Mining and Statistical Learning - 2008
9
Local linear regression
Find the intercept and slope parameters solving
The solution is a linear combination of yi:
![Page 10: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/10.jpg)
Data Mining and Statistical Learning - 2008
10
Kernel smoothing vs local linear regression
Kernel smoothing
Solve the minimization problem
Local linear regression
Solve the minimization problem
N
iiixa xyxxK
1
200)( )]()[,(min
0
N
iiiixxa xxxyxxK
1
2000)(),( ])()()[,(min
00
![Page 11: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/11.jpg)
Data Mining and Statistical Learning - 2008
11
Properties of local linear regression
• Automatically modifies the kernel weights to correct for bias
• Bias depends only on the terms of order higher than one in the expansion of f.
![Page 12: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/12.jpg)
Data Mining and Statistical Learning - 2008
12
Local polynomial regression
• Fitting polynomials instead of straight lines
Behavior of estimated response function:
![Page 13: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/13.jpg)
Data Mining and Statistical Learning - 2008
13
Polynomial vs local linear regression
Advantages:
• Reduces the ”Trimming of hills and filling of valleys”
Disadvantages:
• Higher variance (tails are more wiggly)
![Page 14: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/14.jpg)
Data Mining and Statistical Learning - 2008
14
Selecting the width of the kernel
Bias-Variance tradeoff:
Selecting narrow window leads to high variance and low bias whilst selecting wide window leads to high bias and low variance.
![Page 15: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/15.jpg)
Data Mining and Statistical Learning - 2008
15
Selecting the width of the kernel
1. Automatic selection ( cross-validation)
2. Fixing the degrees of freedom
ijij xlSS ,ˆ yf
Stracedf
![Page 16: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/16.jpg)
Data Mining and Statistical Learning - 2008
16
Local regression in RP
The one-dimensional approach is easily extended to p dimensions by
• Using the Euclidian norm as a measure of distance in the kernel.
• Modifying the polynomial
,,,,,,1 2221
2121 XXXXXXXb
![Page 17: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/17.jpg)
Data Mining and Statistical Learning - 2008
17
Local regression in RP
”The curse of dimensionality”
• The fraction of points close to the boundary of the input domain increases with its dimension
• Observed data do not cover the whole input domain
![Page 18: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/18.jpg)
Data Mining and Statistical Learning - 2008
18
Structured local regression models
Structured kernels (standardize each variable)
Note: A is positive semidefinite
![Page 19: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/19.jpg)
Data Mining and Statistical Learning - 2008
19
Structured local regression models
Structured regression functions
• ANOVA decompositions (e.g., additive models)
Backfitting algorithms can be used
• Varying coefficient models (partition X)
• INSERT FORMULA 6.17
![Page 20: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/20.jpg)
Data Mining and Statistical Learning - 2008
20
Structured local regression models
Varying coefficient
models (example)
![Page 21: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/21.jpg)
Data Mining and Statistical Learning - 2008
21
Local methods
• Assumption: model is locally linear ->maximize the log-likelihood locally at x0:
• Autoregressive time series. yt=β0+β1yt-1+…+ βkyt-k+et ->
yt=ztT β+et. Fit by local least-squares with kernel K(z0,zt)
![Page 22: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/22.jpg)
Data Mining and Statistical Learning - 2008
22
Kernel density estimation
• Straightforward estimates of the density are bumpy
• Instead, Parzen’s smooth estimate is preferred:
Normally, Gaussian kernels are used
![Page 23: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/23.jpg)
Data Mining and Statistical Learning - 2008
23
Radial basis functions and kernels
Using the idea of basis expansion, we treat kernel functions as basis functions:
where ξj –prototype parameter, λj-scale parameter
![Page 24: Data Mining and Statistical Learning - 2008](https://reader036.fdocuments.us/reader036/viewer/2022062513/5576535ad8b42aaa548b4a11/html5/thumbnails/24.jpg)
Data Mining and Statistical Learning - 2008
24
Radial basis functions and kernels
Choosing the parameters:
•
• Estimate {λj, ξj } separately from βj (often by using the distribution of X alone) and solve least-squares.