Multiple Kernel Learning based Approach to Representation and Feature Selection for Image...
-
Upload
icac09 -
Category
Technology
-
view
1.611 -
download
1
Transcript of Multiple Kernel Learning based Approach to Representation and Feature Selection for Image...
1
Multiple Kernel Learning based Approach to Representation and Feature Selection for
Image Classification using Support Vector Machines
Dr. C. Chandra Sekhar
Speech and Vision LaboratoryDept. of Computer Science and EngineeringIndian Institute of Technology MadrasChennai-600036
Presented atICAC-09, Tiruchirappalli on August 8, 2009
2
• Kernel methods for pattern analysis• Support vector machines for pattern
classification• Multiple kernel learning• Representation and feature selection• Image classification
Outline of the Talk
3
Residential Interiors
Mountains
Military Vehicles
Sacred Places
Sunsets & Sunrises
Image Classification
4
• Pattern: Any regularity, relation or structure in data or source of data
• Pattern analysis: Automatic detection and characterization of relations in data
• Statistical and machine learning methods for pattern analysis assume that the data is in vectorial form
• Relations among data are expressed as– Classification rules– Regression functions– Cluster structures
Pattern Analysis
5
• Stage I (1960’s):– Learning methods to detect linear relations among data– Perceptron models
• Stage II (Mid 1980’s):– Learning methods to detect nonlinear patterns– Gradient descent methods: Local minima problem– Multilayer feedforward neural networks
• Stage III (Mid 1990’s):– Kernel methods to detect nonlinear relations– Problem of local minima does not exist– Methods are applicable to non-vectorial data also
• Graphs, sets, texts, strings, biosequences, graphs, trees– Support vector machines
Evolution of Pattern Analysis Methods
6
Artificial Neural Networks
7
Neuron with Threshold LogicActivation Function
x1
x2
xM
w1
w2
wM
i=1WiXi-θ∑
M
Input Weights Activation value
Output signal
a s=f(a)
Threshold logic activation function
• McCulloch-Pitts Neuron
8
Perceptron
x1
x2
• Linearly separable classes: Regions of two classes are separable by a linear surface (line, plane or hyperplane)
9
Hard Problems
• Nonlinearly separable classes
a1
a2
10
Neuron with Continuous Activation Function
x1
x2
xM
w1
w2
wM
i=1WiXi-θ∑
M
Input Weights Activation value
Output signal
a s=f(a)
Sigmoidal activation function
11
Multilayer Feedforward Neural Network
• Architecture:– Input layer ---- Linear neurons– One or more hidden layers ---- Nonlinear neurons – Output layer ---- Linear or nonlinear neurons
Input Layer Output LayerHidden Layer
1
2
i
I
1
2
j
J
1
2
k
K
x1
x2
xi
.
.
.
xI
.
.
.
s1
s2
sk
.
.
.
sK
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
Gradient Descent Method
Error
Weight
Global minimum
Local minimum
Local minimum
13
Artificial Neural Networks: Summary• Perceptrons, with threshold logic function as
activation function, are suitable for linearly separable classes
• Multilayer feedforward neural networks, with sigmoidal function as activation function, are suitable for nonlinearly separable classes – Complexity of the model depends on
• Dimension of the input pattern vector• Number of classes• Shapes of the decision surfaces to be formed
– Architecture of the model is empirically determined– Large number of training examples are required when
the complexity of the model is high– Local minima problem– Suitable for vectorial type of data
14
Kernel Methods
15
Key Aspects of Kernel Methods• Kernel methods involve
– Nonlinear transformation of data to a higher dimensionalfeature space induced by a Mercer kernel
– Detection of optimal linear solutions in the kernel feature space
• Transformation to a higher dimensional space is expected to be helpful in conversion of nonlinear relations into linear relations (Cover’s theorem)– Nonlinearly separable patterns to linearly separable patterns– Nonlinear regression to linear regression– Nonlinear separation of clusters to linear separation of clusters
• Pattern analysis methods are implemented in such a way that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only.
• The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick)
16
x
y
xyyx 2,, )X( 22X
Illustration of Transformation
17
Support Vector Machines for Pattern Classification
18
Optimal Separating Hyperplane
z1
z2
19
Maximum Margin Hyperplane
20
Maximum Margin Hyperplane
b
Margin =||||
1w
0)(T bxw
1)(T bxw
1)(T bxw
21
22
23
24
25
26
27
28
29
30
31
Kernel Functions
• Mercer kernels: Kernel functions that satisfy
Mercer’s theorem
• Kernel gram matrix: The matrix containing the
values of the kernel function on all pairs of data
points in the training data set
• Kernel gram matrix should be positive semi-definite,
i.e., the eigenvalues should be non-negative, for
convergence of the iterative method used for
solving the constrained optimization problem
• Kernels for vectorial data:
– Linear kernel
– Polynomial kernel
– Gaussian kernel
32
Kernel Functions• Kernels for non-vectorial data:
– Kernels on graphs
– Kernels on sets
– Kernels for texts
• Vector space kernels
• Semantic kernels
• Latent semantic kernels
– Kernels on strings
• String kernels
• Trie-based kernels
– Kernels on trees
– Kernels from generative models
• Fisher kernels
– Kernels on images
• Hausdorff kernels
• Histogram intersection kernels
• Kernels learnt from data
• Kernels optimized for data
33
34
35
36
Applications of Kernel Methods• Genomics and computational biology
• Knowledge discovery in clinical microarray data analysis
• Recognition of white blood cells of Leukaemia
• Classification of interleaved human brain tasks in fMRI
• Image classification and retrieval
• Hyperspectral image classification
• Perceptual representations for image coding
• Complex SVM approach to OFDM coherent demodulation
• Smart antenna array processing
• Speaker verification
• Kernel CCA for learning the semantics of text
G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel
Methods in Bioengineering, Signal and Image Processing, Idea
Group Publishing, 2007
37
Work on Kernel Methods at IIT MadrasFocus:
Design of suitable kernels and development of kernel
methods for speech, image and video processing tasks
Tasks:
• Speech recognition
• Speech segmentation
• Speaker segmentation
• Voice activity detection
• Handwritten character recognition
• Face detection
• Image classification
• Video shot boundary detection
• Time series data processing
38
Kernel Methods: Summary• Kernel methods involve
– Nonlinear transformation of data to a higher dimensionalfeature space induced by a Mercer kernel
– Construction of optimal linear solutions in the kernel feature space
• Complexity of model is not dependent on the dimension of data
• Models can be trained with small size datasets• Kernel methods can be used for non-vectorial type
of data also• Performance of kernel methods is dependent on
the choice of the kernel. Design of a suitable kernel is a research problem.
• Approaches to the design of kernels by learning the kernels from data or by constructing the optimized kernels for a given data are being explored
39
Multiple Kernel Learning based Approach to Representation and Feature Selection for
Image Classification using Support Vector Machines
Dileep A.D. and C.Chandra Sekhar, “Representation and feature selection using multiple kernel learning”, in Proceedings of International Joint Conference on Neural Networks, vol.1, pp.717-722, Atlanta, USA, June 2009.
40
Representation and Feature Selection
• Early fusion for combining multiple representations
• Feature Selection– Compute a class separability measuring criterion function– Rank features in the decreasing order of the criterion function – Features corresponding to the best values of criterion function
are then selected
• Multiple kernel learning (MKL) approach is to develop an integrated framework to perform both representation selection and feature selection
Representation 1 ... Representation m
Classifier
Representation 2
Representation 1 ... Representation mRepresentation 2...
41
Multiple Kernel Learning (MKL)
• A new kernel can be obtained as a linear combination of many base kernels– Each of the base kernels is a Mercer kernel
- Non-negative coefficient represents the weight of the lth base kernel
• Issues– Choice of the base kernel– Technique for determining the values of
weights
),(),( jill
lji kk xxxx
l
42
MKL framework for SVM• In SVM framework, the MKL task is considered as a way of
optimizing the kernel weights while training the SVM
• The optimization problem of SVM (Dual problem) is given as:
mlniC
y
ts
kyy
l
i
n
iii
n
i
n
i
n
j
m
ljilljijii
,...,1 ,0,...,1 0
0
..
),(21max
1
1 1 1 1
xx
• Both the Lagrange coefficients αi and the base kernel weights γl are determined by solving the optimization problem
43
Representation and Feature Selection using MKL
• A base kernel is computed for each representation of data
• MKL method is used to determine the weights for the base kernels
• Base kernels with non-zero weights indicate the relevant representations
• Similarly, a base kernel is computed for each feature in a representation
• MKL method is used to select the relevant features with non-zero weights of the corresponding base kernels
44
Studies on Image Categorization• Dataset:
– Data of 5 classes in Corel image database– Number of images per class: 100 (Training: 90, Test: 10, 10-fold evaluation)
Class 1[African specialty
animals]
Class 2[Alaskan wildlife]
Class 3[Arabian horses]
Class 4[North American
wildlife]
Class 5[Wild life
Galapagos] 44
45
Representations Information Representation Tiling Dimension
Colour
DCT coefficients of average colour in 20 x 20 grid Global 12
CIE L*a*b* colour coordinates of two dominant colour clusters
Global 6
Shape Magnitude of the 16 x 16 FFT of Sobel edge image
Global 128
Colour and Structure
Haar transform of quantized HSV colour histogram Global 256
Shape MPEG-7 edge histogram 4 x 4 80
ColourThree first central moments of L*a*b* CIE colour distribution 5 45
Average CIE L*a*b* colour 5 15
ShapeCo-occurrence matrix of four Sobel edge directions 5 80
Histogram of four Sobel edge directions 5 20
Texture Texture feature based on histogram of relative brightness of neighbouring pixels 5 40
46
Classification Performance for Different Representations
• Classification accuracy (in %) of SVM based classifier
Representation Linear Kernel
Gaussian Kernel
DCT coefficients of average colour in 20 x 20 grid 60.8 69.2
CIE L*a*b* colour coordinates of two dominant colour clusters
60.8 70.2
Magnitude of the 16 x 16 FFT of Sobel edge image 46.6 57.6
Haar transform of quantized HSV colour histogram 84.2 86.8
MPEG-7 edge histogram 53.8 53.4
Three first central moments of L*a*b* CIE colour distribution 83.8 85.6
Average CIE L*a*b* colour 68.2 78.6
Co-occurrence matrix of four Sobel edge directions 44.0 49.8
Histogram of four Sobel edge directions 36.8 49.2
Texture feature based on histogram of relative brightness of neighbouring pixels
46.0 51.6
47
Class-wise Performance • Gaussian kernel SVM based classifier
RepresentationClassification Accuracy (in %)
Class1 Class2 Class3 Class4 Class5
DCT coefficients of average colour in 20 x 20 grid 77 40 97 74 73
CIE L*a*b* colour coordinates of two dominant colour clusters 78 39 96 65 80
Magnitude of the 16 x 16 FFT of Sobel edge image 53 29 83 61 72
Haar transform of quantized HSV colour histogram 86 80 99 90 92
MPEG-7 edge histogram 38 31 82 44 56
Three first central moments of L*a*b* CIE colour distribution 86 79 100 86 80
Average CIE L*a*b* colour 75 52 96 81 84
Co-occurrence matrix of four Sobel edge directions 59 30 77 41 52
Histogram of four Sobel edge directions 49 22 71 24 55
Texture feature based on histogram of relative brightness of neighbouring pixels 40 33 89 43 52
48
Performance for Combination of All Representations
Combination of representations
Dimension
Classification Accuracy (in %)
LinearKernel
Gaussian Kernel
Supervector obtained using all the representations
682 90.40 91.20
• Supervector obtained using all representations gives a better performance than any one representation
• Dimension of the combined representation is high
Representation 1 ... Representation 10Representation 2
Representation 1 ... Representation 10Representation 2
49
Representations Selected using MKL
• Linear kernel as base kernelMagnitude of
the 16 x 16 FFT of Sobel edge
image
Dim: 128
0.1170
Haar transform of quantized HSV colour histogram
Dim: 256
0.5627
MPEG-7 edge histogram
Dim: 80
0.1225
DCT coefficients of average colour
in 20 x 20 rectangular
grid
CIE L*a*b* colour
coordinates of two dominant colour clusters
Three first central
moments of L*a*b* CIE
colour distribution
Dim: 45
0.1615
Co-occurrence matrix of four
Sobel edge directions
Dim: 80
0.0261
Texture feature
based on histogram of
relative brightness of neighbouring
pixelsDim: 40
0.0101
Average CIE L*a*b* colour
Histogram of four Sobel
edge directions
50
Representations Selected using MKL
• Gaussian kernel as base kernel
CIE L*a*b* colour
coordinates of two dominant colour clusters
Dim: 6
0.0064
Magnitude of the 16 x 16 FFT of Sobel edge
image
Dim: 128
0.0030
Haar transform of quantized HSV colour histogram
Dim: 256
0.6045
MPEG-7 edge histogram
Dim: 80
0.1535
Average CIE L*a*b* colour
Dim: 15
0.2326
DCT coefficients of average colour
in 20 x 20 rectangular
grid
Three first central
moments of L*a*b* CIE
colour distribution
Co-occurrence matrix of four
Sobel edge directions
Histogram of four Sobel
edge directions
Texture feature
based on histogram of
relative brightness of neighbouring
pixels
51
Performance for Different Methods of Combining Representations
• Classification accuracy (in %)
Combination of representation
Linear Kernel Gaussian Kernel
Dimension Accuracy Dimension Accuracy
Supervector of all the representations 682 90.4 682 91.2
MKL approach 682 92.0 682 92.4
Supervector of representations with non-zero weights determined using MKL method
629 91.8 485 93.0
• Supervector obtained using representations with non-zero weights gives the same performance as MKL approach, with a significantly reduced dimension of the combined representation
52
Feature Selection using MKL• Representations and features selected using MKL with Gaussian
kernel as base kernel• Classification accuracy (in %)
RepresentationDimension Classification accuracy
Actual features
Selected features
Actual features
Weighted features
CIE L*a*b* colour coordinates of two dominant colour clusters
6 6 70.2 70.6
Magnitude of the 16 x 16 FFT of Sobel edge image 128 62 57.6 60.4
Haar transform of quantized HSV colour histogram
256 63 86.8 88.0
MPEG-7 edge histogram 80 53 53.4 53.8
Average CIE L*a*b* colour
15 14 78.6 79.0
53
Performance of Different Methods for Feature Selection
Representation DimensionClassification
accuracy (in %)
Supervector of representationswith non-zero weight values determined using MKL method
485 93.0
Supervector of features with non-zero weight values determined using MKL method
198 93.0
54
Multiple Kernel Learning: Summary
• Multiple kernel learning (MKL) approach is useful for combining representations as well as selecting features, with weights representations/features determined automatically
• Selection of representations and features using MKL approach is studied for image categorization task
• Combining representations using MKL approach performs marginally better than the combination of all representations
• Dimensionality of the feature vector obtained using the MKL approach is significantly less
55
1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
2. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India, 1999.
3. Satish Kumar, Neural Networks - A Class Room Approach, Tata McGraw-Hill, 2004.
4. Simon Haykin, Neural Networks – A Comprehensive Foundation, 2nd Edition, Prentice Hall, 1999.
5. John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004
6. B.Scholkopf and A.J.Smola, Learning with Kernels – Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002
7. R.Herbrich, Learning Kernel Classifiers – Theory and Algorithms, MIT Press, 2002
8. B.Scholkopf, K.Tsuda and J-P.Vert, Kernel Methods in Computational Biology, MIT Press, 2004
9. G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, 2007
10. F.Camastra and A.Vinciarelli, Machine Learning for Audio, Image and Video Analysis – Theory and Applications, Springer, 2008
Web resource: www.kernelmachines.org
Text Books
56
Thank You