Christopher M. Bishop, Pattern Recognition and Machine Learning.
-
Upload
alan-morrison -
Category
Documents
-
view
282 -
download
3
Transcript of Christopher M. Bishop, Pattern Recognition and Machine Learning.
![Page 1: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/1.jpg)
Christopher M. Bishop,Pattern Recognition and Machine Learning
![Page 2: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/2.jpg)
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
2
![Page 3: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/3.jpg)
Supervised Learning
In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning
y(x)
(x,t)(1,60,pass)(2,53,fail)(3,77,pass)(4,34,fail)﹕
output
3
![Page 4: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/4.jpg)
Classificationx2
x1
y=0 y>0y<0
t=-1 t=14
![Page 5: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/5.jpg)
Regression
0 1 x
t
0
1
-1
new x 5
![Page 6: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/6.jpg)
Linear Models Linear models for regression and
classification:
if we apply feature extraction,
0 1 1 1 D( ) where x = (x ,...,x ) D Dy x x x
1
0 01
( ) ( ) ( )M
Tj j
j
y x x w x
inputmodel
parameter
6
![Page 7: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/7.jpg)
Problems with Feature Space Why feature extraction? Working in
high dimensional feature spaces solves the problem of expressing complex functions
Problems: - there is a computational problem
(working with very large vectors) - curse of dimensionality
7
![Page 8: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/8.jpg)
Kernel Methods (1)
Kernel function: inner products in some feature space nonlinear similarity measure
Examples - polynomial: - Gaussian:
( , ') ( ) ( ')Tk x x x x
( , ') ( ' )T dk x x x x c 2 2( , ') exp( ' / 2 )k x x x x
8
![Page 9: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/9.jpg)
Kernel Methods (2)
Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally only require inner products between data (input)
2 21 1 2 2
2 2 2 21 1 1 1 2 2 2 2
2 2 2 21 1 2 2 1 1 2 2
( , ) ( ) ( )
2
( , 2 , )( , 2 , )
( ) ( )
T
T
T
k x z x z x z x z
x z x z x z x z
x x x x z z z z
x z
9
![Page 10: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/10.jpg)
Kernel Methods (3)
We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but
just working out the inner product in the data space
10
![Page 11: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/11.jpg)
Kernel Methods (4)
Kernel methods exploit information about the inner products between data items
We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function
If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target
11
![Page 12: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/12.jpg)
Kernel Methods (5)
Two basic modules for kernel methods
General purpose learning model
Problem specific kernel function
12
![Page 13: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/13.jpg)
Kernel Methods (6)
Limitation: the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training points when making predictions for new data points
Sparse kernel machine makes prediction only by a subset of the training data points
13
![Page 14: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/14.jpg)
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
14
![Page 15: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/15.jpg)
Support Vector Machines (1) Support Vector Machines are a
system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory
Generalization theory describes how to control the learning machines to prevent them from overfitting
15
![Page 16: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/16.jpg)
Support Vector Machines (2) To avoid overfitting, SVM modify the error
function to a “regularized form” where hyperparameter λ balances the
trade-off The aim of EW is to limit the estimated
functions to smooth functions As a side effect, SVM obtain a sparse
model
( ) ( ) ( ) D WE w E w E w
16
![Page 17: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/17.jpg)
Support Vector Machines (3)
17
Fig. 1 Architecture of SVM
![Page 18: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/18.jpg)
SVM for Classification (1) The mechanism to prevent
overfitting in classification is “maximum margin classifiers”
SVM is fundamentally a two-class classifier
18
![Page 19: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/19.jpg)
Maximum Margin Classifiers (1) The aim of classification is to find a
D-1 dimension hyperplane to classify data in a D dimension space
2D example:
19
![Page 20: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/20.jpg)
Maximum Margin Classifiers (2)
margin
support vectors
support vectors
20
![Page 21: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/21.jpg)
Maximum Margin Classifiers (3)
small margin large margin
21
![Page 22: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/22.jpg)
Maximum Margin Classifiers (4) Intuitively it is a “robust” solution - If we’ve made a small error in
the location of the boundary, this gives us least chance of causing a misclassification
The concept of max margin is usually justified using Vapnik’s Statistical learning theory
Empirically it works well22
![Page 23: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/23.jpg)
SVM for Classification (2) After the optimization process, we
obtain the prediction model:
where (xn,tn) are N training data we can find that an will be zero
except for that of the support vectors sparse
23
1
( ) ( , ) N
n n nn
y x a t k x x b
![Page 24: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/24.jpg)
SVM for Classification (3)
24
Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function
![Page 25: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/25.jpg)
SVM for Classification (4) For overlapping class
distributions, SVM allow some of the training points to be misclassified soft margin
25
penalty
![Page 26: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/26.jpg)
SVM for Classification (5) For multiclass problems, there are
some methods to combine multiple two-class SVMs
- one versus the rest - one versus one more
training time
26
Fig. 3 Problems in multiclass classification using multiple SVMs
![Page 27: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/27.jpg)
SVM for Regression (1)
For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function”
27
quadratic error
functionε-insensitive
error funciton
![Page 28: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/28.jpg)
SVM for Regression (2)
28Fig . 4 ε-tube
No error
×
Error = |y(x)-t|- ε
![Page 29: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/29.jpg)
SVM for Regression (3)
After the optimization process, we obtain the prediction model:
we can find that an will be zero except for that of the support vectors sparse
29
1
ˆ( ) ( ) ( , ) N
n n nn
y x a a k x x b
![Page 30: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/30.jpg)
SVM for Regression (4)
30Fig . 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube
![Page 31: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/31.jpg)
Disadvantages
It’s not sparse enough since the number of support vectors required typically grows linearly with the size of the training set
Predictions are not probabilistic The estimation of error/margin trade-off
parameters must utilize cross-validation which is a waste of computation
Kernel functions are limited Multiclass classification problems
31
![Page 32: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/32.jpg)
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
32
![Page 33: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/33.jpg)
Relevance Vector Machines (1) The relevance vector machine (RVM)
is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations
RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM
33
![Page 34: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/34.jpg)
Relevance Vector Machines (2) RVM intend to mirror the structure of the
SVM and use a Bayesian treatment to remove the limitations of SVM
the kernel functions are simply treated as basis functions, rather than dot-product in some space
34
1
( ) ( , ) N
n nn
y x w k x x b
![Page 35: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/35.jpg)
Bayesian Inference Bayesian inference allows one to model
uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence.
35
![Page 36: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/36.jpg)
Relevance Vector Machines (3) In the Bayesian framework, we use a
prior distribution over w to avoid overfitting
where α is a hyperparameter which control the model parameter w
36
1/ 2 2
1
( | ) ( ) exp( )2 2
N
mm
p w w
![Page 37: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/37.jpg)
Relevance Vector Machines (4) Goal: find most probable α* and β* to
compute the predictive distribution over tnew for a new input xnew, i.e.
p(tnew | xnew, X, t, α*, β*)
Maximize the likelihood function to obtain α* and β* :
p(t|X, α, β)
37
Training data and their target values
![Page 38: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/38.jpg)
Relevance Vector Machines (5) RVM utilize the “automatic relevance
determination” to achieve sparsity
where αm represents the precision of wm
In the procedure of finding αm*, some αm will become infinity which leads the corresponding wm to be zero remain relevance vectors !
38
1/ 2 2
1
( | ) ( ) exp( )2 2
Nm m
mm
p w w
![Page 39: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/39.jpg)
Comparisons - Regression
39
RVM (on standard deviation predictive distribution)
SVM
![Page 40: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/40.jpg)
Comparisons - Regression
40
![Page 41: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/41.jpg)
Comparison - Classification
41
RVM SVM
![Page 42: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/42.jpg)
Comparison - Classification
42
![Page 43: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/43.jpg)
Comparisons
RVM are much sparser and make probabilistic prediction
RVM gives better generalization in regression
SVM gives better generalization in classification
RVM is computationally demanding while learning
43
![Page 44: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/44.jpg)
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
44
![Page 45: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/45.jpg)
Applications (1)
SVM for face detection
45
![Page 46: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/46.jpg)
Applications (2)
46Marti Hearst, “ Support Vector Machines” ,1998
![Page 47: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/47.jpg)
Applications (3)
In the feature-matching based object tracking, SVM are used to detect false feature matches
47Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001
![Page 48: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/48.jpg)
Applications (4)
Recovering 3D human poses by RVM
48A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression” 2004
![Page 49: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/49.jpg)
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
49
![Page 50: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/50.jpg)
Conclusions
The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks
The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions
50
![Page 51: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/51.jpg)
References
www.support-vector.net N. Cristianini and J. Shawe-Taylor,
“An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000
M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001
51
![Page 52: Christopher M. Bishop, Pattern Recognition and Machine Learning.](https://reader035.fdocuments.us/reader035/viewer/2022081504/56649f355503460f94c538cc/html5/thumbnails/52.jpg)
Underfitting and Overfitting
52
underfitting-too simple overfitting-too complex
Adapted from http://www.dtreg.com/svm.htm
new data