Gaussian Kernel Width Exploration and Cone Cluster Labeling For Support Vector Clustering
Kernel Clustering
Transcript of Kernel Clustering
![Page 1: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/1.jpg)
Kernel ClusteringCSE 902
Radha Chitta
![Page 2: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/2.jpg)
Outline
● Data analysis
● Clustering
● Kernel Clustering● Kernel K-means and Spectral Clustering
● Challenges and Solutions
![Page 3: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/3.jpg)
Data Analysis
![Page 4: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/4.jpg)
Learning Problems
![Page 5: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/5.jpg)
Learning Problems
➢ Supervised Learning (Classification)
➢ Semi-supervised learning
This is a cat. This is a dog. What is this?
This is a cat. This is a dog.
![Page 6: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/6.jpg)
Learning Problems
➢ Clustering into “known” number of clusters
➢ Completely unsupervised learning
There are two objects in this set of images. Separate them into “two” groups.
No additional information. Just images
![Page 7: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/7.jpg)
Clustering Algorithms
Clustering Algorithms
HierarchicalAgglomerative Complete
Link
Divisive
Partitional
Distribution/density
Mixture models
Squared-error
K-means
Kernel K-means
Graph theoretic Spectral
MST
Single Link
Nearest neighbor 1 to k-NN
DBScan
![Page 8: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/8.jpg)
Clustering Algorithms
Clustering Algorithms
HierarchicalAgglomerative Complete
Link
Divisive
Partitional
Distribution/density
Mixture models
Squared-error
K-means
Kernel K-means
Graph theoretic Spectral
MST
Single Link
Nearest neighbor 1 to k-NN
DBScan
![Page 9: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/9.jpg)
The K-means problem
Given data set
find C representative points in the data set to minimize sum of squared error
0-1 integer programming problem: NP-Complete
D={ x1, x2,… , xn }
minU∑k=1
C
∑i=1
n
U ki∥ck−x i∥22 Euclidean distance
Cluster centers1 if belongs to cluster k; 0 otherwise
xi
![Page 10: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/10.jpg)
Lloyd's K-means algorithm
● Initialize labels.
● Repeat until convergence
– Find centers of clusters
– Assign points to the closest center.
O(nCd) running time complexity
![Page 11: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/11.jpg)
K-means doesn't always work!
Euclidean distance assumes unit covariance and linear separability
![Page 12: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/12.jpg)
Non-linear separation
Any data set is linearly separable in sufficiently high dimensional space
ϕ : Χ →Η
ϕ ( x )=( x12 ,√2 x1 x 2 , x2
2 )
x1
x2
z1
z 2
z 3
Polynomial
![Page 13: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/13.jpg)
Dot product kernels
Euclidean distance in
Polynomial map
Η
∥ϕ ( x )−ϕ ( y )∥22=ϕ ( x )T ϕ ( x )−2ϕ ( x )Tϕ ( y )+ϕ ( y )T ϕ ( y )
ϕ ( x )Tϕ ( y ) = ( x1
2 √2 x1 x2 x22 )(
y12
√2 y1 y2
y 22 ) = ( xT y )
2
ϕ ([ x1, x2] )=( x12 ,√2 x1 x2 , x2
2 )
![Page 14: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/14.jpg)
Dot product kernels
Euclidean distance in
Polynomial map
Η
∥ϕ ( x )−ϕ ( y )∥22=ϕ ( x )T ϕ ( x )−2ϕ ( x )Tϕ ( y )+ϕ ( y )T ϕ ( y )
ϕ ( x )Tϕ ( y ) = ( x1
2 √2 x1 x2 x22 )(
y12
√2 y1 y2
y 22 ) = ( xT y )
2
ϕ ([ x1, x2] )=( x12 ,√2 x1 x2 , x2
2 )
![Page 15: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/15.jpg)
Dot product kernels
Euclidean distance in
Polynomial map
Η
∥ϕ ( x )−ϕ ( y )∥22=ϕ ( x )T ϕ ( x )−2ϕ ( x )Tϕ ( y )+ϕ ( y )T ϕ ( y )
ϕ ( x )Tϕ ( y ) = ( x1
2 √2 x1 x2 x22 )(
y12
√2 y1 y2
y 22 ) = ( xT y )
2
ϕ ( x )=( x12 ,√2 x1
2 , x22 )
Kernel functionκ ( x , y )
![Page 16: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/16.jpg)
Mercer kernels
There exists a mapping and an expansion
iff for any such that
is finite then
d H2 ( x , y )=κ ( x , x )−2 κ ( x , y )+κ ( y , y )
ϕ ( x )
κ ( x , y )=ϕ ( x )Tϕ ( y )
g ( x )
∫ g ( x )2dx
∫κ ( x , y ) g ( x ) g ( y )dxdy ⩾0
![Page 17: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/17.jpg)
Kernels
● Polynomial (Linear)
● Gaussian
● Chi-square
● Histogram intersection
κ ( x , y )= (xT y )p
κ ( x , y )=exp (−λ∥x− y∥22 )
κ ( x , y )=1−∑i=1
d ( x i− y i )2
0.5 ( x i+ y i )
κ ( x , y )=∑i=1
m
min (hist ( x ) ,hist ( y ) )
![Page 18: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/18.jpg)
Kernel K-means
Finding C clusters in Hilbert (feature) space H to minimize sum of squared error
Calculate kernel (Gram) matrix and run K-means in H
minU∑k=1
C
∑i=1
n
U ki∥ck−ϕ ( x i )∥H2
K=[κ ( x i , x j ) ]n×n ; x i , x j∈D
minU∑i=1
n
∑k=1
C
U ki[κ ( x i , x i )−2nk∑j=1
n
U kj κ( x i , x j)+1nk
2∑j=1
n
∑l=1
n
U kjU kl κ( x j , x l)]
max trace (UKU ' )
![Page 19: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/19.jpg)
Kernel K-means is able to find “complex” clusters.
K-means Kernel K-meansData
K-means Vs Kernel K-means
![Page 20: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/20.jpg)
Spectral Clustering
Equivalent to Weighted Kernel K-means
Represents the data using the top K eigenvectors and obtain the clusters
max trace (UD−1 /2 KD−1 /2U ' )
![Page 21: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/21.jpg)
Kernel K-means: Challenges
➢ Scalability➢ O(n2) complexity to calculate kernel➢ More expensive than K-means
➢ Out-of-sample clustering➢ No explicit representation for the centers.➢ Expensive to assign a new point to a cluster.
➢ Kernel selection➢ The best kernel is application and
data-dependent.➢ Wrong kernel can lead to results worse than
K-means
![Page 22: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/22.jpg)
Scalability
No. of Objects (n)
No. of operations
K-meansKernel
K-means
O(nCd) O(n2C)
1M 1011 (3.2 ) 1015
10M 1012 (34.9) 1017
100M 1013(5508.5) 1019
1B 1014 1021
d = 100; C=10 d = 10,000; C=10
A petascale supercomputer (IBM Sequoia, June 2012) with ~1 exabyte memory is needed to kernel cluster billions of data points!
No. of Objects (n)
No. of operations
K-meansKernel
K-means
O(nCd) O(n2C)
1M 1013 (6412.9) 1016
10M 1014 1018
100M 1015 1020
1B 1016 1022
* Runtime in seconds on Intel Xeon 2.8 GHz processor using 12 GB memory
![Page 23: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/23.jpg)
Scalability
Reordered clustering Reordered kernel K-means
Distributed clustering Parallel kernel K-means
Sampling based approximations Kernel matrix approximation Non-linear Random Projections
![Page 24: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/24.jpg)
Reordered kernel K-means
Distance
Reorder the clustering process such that only a small portion of the kernel matrix is required at a time.
Store the full kernel matrix on disk and load part of it into the memory.
A Large scale clustering scheme for kernel K-means, Zhang and Rudnicky, ICPR 2002
d 2( x i , ck )=κ (x i , x i )−
2nk∑j=1
n
U kj κ(x i , x j)+1nk
2 ∑j=1
n
∑l=1
n
U kjU kl κ(x j , x l)
![Page 25: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/25.jpg)
Reordered kernel K-means
Update incrementally using the kernel portion available in memory.
Combine values to obtain final partition.
A Large scale clustering scheme for kernel K-means, Zhang and Rudnicky, ICPR 2002
f (x i , ck )=−2nk∑j=1
n
U kj κ( x i , x j)
g (ck )=1nk
2 ∑j=1
n
∑l=1
n
U kjU kl κ (x j , x l)
![Page 26: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/26.jpg)
Parallel Kernel K-meansScaling Up Machine Learning: Parallel and Distributed Approaches, Bekkerman et. al. 2012
![Page 27: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/27.jpg)
Number of processors
Speedup
2 1.9
3 2.5
4 3.8
5 4.9
6 5.2
7 5.5
8 6.3
9 5.5
10 5.1
Network communication cost increases with no. of processors
K-means on 1 billion 2-D points with 2 clusters2.3 GHz quad-core Intel Xeon processors, with 8GB
memory in the HPCC intel07 cluster
Parallel Kernel K-means
![Page 28: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/28.jpg)
Approx. Kernel K-means
➔ Reordering does not reduce runtime complexity, only memory; parallelization does not help reduce runtime and memory complexity
➔ Kernel approximation Use sampling to avoid calculating the full kernel matrix
➔ Express the cluster centers in terms of sampled points
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
![Page 29: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/29.jpg)
Approx. Kernel K-meansApproximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Randomly sample m points and compute the kernel similarity matrices
KA ( m x m) and K
B ( n x m)
{y1, y2,… ym},m≪n
(K A)ij=ϕ( yi)T ϕ( y j) (K B)ij=ϕ(x i)
T ϕ( y j)
![Page 30: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/30.jpg)
Approx. Kernel K-meansApproximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Iteratively optimize for the cluster centers
minUmaxα
∑k=1
C∑i=1
nU ik∥ϕ(x i)− ∑
j=1
mα jk ϕ( y j)∥
ck
![Page 31: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/31.jpg)
Approx. Kernel K-meansApproximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Iteratively optimize for the cluster centers
minUmaxα
∑k=1
C∑i=1
nU ik∥ϕ(x i)− ∑
j=1
mα jk ϕ( y j)∥
ck
![Page 32: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/32.jpg)
Approx. Kernel K-meansApproximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Iteratively optimize for the cluster centers
minUmaxα∑k=1
C
∑i=1
n
U ik∥ϕ( xi)−∑j=1
m
α jk ϕ( y j)∥
ck
![Page 33: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/33.jpg)
Approx. Kernel K-meansApproximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Iteratively optimize for the cluster centers
minUmaxα
∑k=1
C∑i=1
nU ik∥ϕ(x i)− ∑
j=1
mα jk ϕ( y j)∥
ck
![Page 34: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/34.jpg)
Approx. Kernel K-meansApproximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD 2011
Obtain the final cluster labels after convergence
Properties
➔ Need to calculate only O(nm) sized matrix – almost linear runtime and memory complexity.
➔ Bounded O(1/m) approximation error.
Equivalent to running K-means on K BK A−1K B
T Nystrom approximation
![Page 35: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/35.jpg)
Average clustering time into 100 clusters
Approximate kernel K-means (m=1,000)
8.5 hours
K-means 6 hours
Example Clusters
Clustering 80M Tiny Images
2.4 Ghz, 150 GB memory
![Page 36: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/36.jpg)
Clustering 80M Tiny Images
Clustering accuracy on CIFAR-10
Kernel K-means 29.94
Approximate kernel K-means(m = 5,000)
29.76
Nystrom approximation based spectral clustering** 27.09
K-means 26.70
*Ranzato et. Al., Modeling pixel means and covariances using factorized third-order boltzmann machines, CVPR 2010
** Fowlkes et al., Spectral grouping using the Nystrom method, PAMI 2004
Best Supervised Classification Accuracy on CIFAR-10 using GIST features: 54.7*
![Page 37: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/37.jpg)
Matrix approximations
➢ Nystrom➢ Randomly select columns
➢ CUR ➢ Special form of Nystrom for any matrix K ➢ Select the most important columns C and rows R➢ Find U that minimizes the approximation error
➢ Sparse approximation➢ Find matrix N such that is sparse.➢
K=K BK A−1K B
T
∥K−CUR∥F
K̂=K+N
E [N ij ]=0and var (N ij) is small
![Page 38: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/38.jpg)
Out-of-sample clustering
d 2( x , c k )=κ ( x , x )−
2nk∑j=1
n
U kj κ( x , x j)+1nk
2 ∑j=1
n
∑l=1
n
U kjU kl κ (x j , x l)
To cluster a new point x using kernel K-means
![Page 39: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/39.jpg)
Out-of-sample clustering
d 2( x , c k )=κ ( x , x )−
2nk∑j=1
n
U kj κ( x , x j)+1nk
2 ∑j=1
n
∑l=1
n
U kjU kl κ (x j , x l)
To cluster a new point x using kernel K-means
Store the full kernel matrixCompute O(n) kernel values
![Page 40: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/40.jpg)
Out-of-sample clustering
d 2( x , c k )=κ ( x , x )−
2nk∑j=1
n
U kj κ( x , x j)+1nk
2 ∑j=1
n
∑l=1
n
U kjU kl κ (x j , x l)
To cluster a new point x using kernel K-means
Store the full kernel matrixCompute O(n) kernel values
No explicit representation for the centers
![Page 41: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/41.jpg)
Out-of-sample clustering
➢ Kernel PCA➢ Project the data onto the first C eigenvectors
(principal components) of the kernel matrix.➢ Perform clustering in the eigenspace.➢ Ref: Multiway Spectral Clustering with
Out-of-Sample Extensions through Weighted Kernel PCA, Alzate et. al., PAMI 2010
➢ Non-linear random projections
![Page 42: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/42.jpg)
Non-linear Random Projections
Random Projection for dimensionality reduction
Generate a random d x r matrix R. Project data into r-dimensional space
Johnson-Lindenstrauss Lemma
If R is orthonormal and r is sufficiently large, distances are preserved in the projected space.
Efficient Kernel Clustering Using Random Fourier Features, ICDM 2012
(1−ϵ)∥x− y∥2⩽∥x '− y '∥2⩽(1+ϵ)∥x−y∥2
x '=1
√rRT x y '=
1
√rRT y
![Page 43: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/43.jpg)
Non-linear Random Projections
Kernel K-means is K-means in Hilbert space where data is linearly separable.
Efficiently project data in Hilbert space to a low-dimensional space.
Linear separability is preserved.
Apply K-means in the low-dimensional space.
Obtain low-dimensional representation of cluster centers.
Efficient Kernel Clustering Using Random Fourier Features, ICDM 2012
![Page 44: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/44.jpg)
Non-linear Random ProjectionsEfficient Kernel Clustering Using Random Fourier Features, ICDM 2012
ω x
Map data into low-dimensional space using f (ω , x )
f (ω , x)=[cos (ωT x ) sin (ωT x )]T
κ(x , y)=Eω [ f (ω , x)T f (ω , y)]
Random Fourier Features
![Page 45: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/45.jpg)
Non-linear Random ProjectionsEfficient Kernel Clustering Using Random Fourier Features, ICDM 2012
Randomly sample m vectors from Fourier transform of the kernel and project the points using {ω1,ω2,…ωm}, m≪n
f (ω , x)
z (x)=1
√m[cos (ω1
T x )…cos (ωmT x )…sin (ω1
T x ) sin (ωmT x ) ]
H=[ z (x1)T z (x2)
T… z (xn)T ]
![Page 46: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/46.jpg)
Non-linear Random ProjectionsEfficient Kernel Clustering Using Random Fourier Features, ICDM 2012
Cluster H using K-means and obtain the partition
Properties ➔ O(nmd) runtime complexity and O(nm) memory complexity.
➔ Bounded approximation error.
➔ Explicit representation for the centers, add a new point by projecting and assigning to the closest center.
O (1 /√m)
![Page 47: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/47.jpg)
Clustering 80M Tiny Images
Average clustering time into 100 clusters
Approximate kernel K-means (m=1,000)
8.5 hours
Clustering using Random Fourier Features (m=1,000)
7.8 hours
K-means 6 hours
Example Clusters2.4 Ghz, 150 GB memory
![Page 48: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/48.jpg)
Kernel Selection
Kernel selection problems
Supervised Semi- Supervised Unsupervised
Parametric Non-Parametric
Optimize parameters of kernel functioneg. MKL
Learn PSD matrix directly from data. eg. Maximize alignment with class labels
Parametric Non-Parametric
Use pairwise constraints between points to build similarity matrix
Unsupervised MKL, Maximum margin clustering with MKL
Learn PSD matrix and labels simultaneously
![Page 49: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/49.jpg)
Unsupervised Non-Parametric Kernel Learning
➔ Highly flexible, least amount of prior knowledge required
➔ Learn kernel and labels simultaneously
➔ Maintain the spectrum of the data after projection into feature space
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
minK ,U
trace (KL )+12U (K+ γC KL+
IC )
−1
U T
s.t.K≽0
trace (K p)⩽B−l⩽n p−nq⩽l
L=I−D−1/2WD−1/2
![Page 50: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/50.jpg)
Unsupervised Non-Parametric Kernel Learning
➔ Highly flexible, least amount of prior knowledge required
➔ Learn kernel and labels simultaneously
➔ Maintain the spectrum of the data after projection into feature space
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
minK ,U
trace (KL )+12U (K+ γC KL+
IC )
−1
U T
s.t.K≽0
trace (K p)⩽B−l⩽n p−nq⩽l
L=I−D−1/2WD−1/2Graph Laplacian
Spectrum regularization
![Page 51: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/51.jpg)
Unsupervised Non-Parametric Kernel Learning
➔ Highly flexible, least amount of prior knowledge required
➔ Learn kernel and labels simultaneously
➔ Maintain the spectrum of the data after projection into feature space
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
minK ,U
trace (KL )+12U (K+ γC KL+
IC )
−1
U T
s.t.K≽0
trace (K p)⩽B−l⩽n p−nq⩽l
L=I−D−1/2WD−1/2
Squared loss
![Page 52: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/52.jpg)
Unsupervised Non-Parametric Kernel Learning
➔ Highly flexible, least amount of prior knowledge required
➔ Learn kernel and labels simultaneously
➔ Maintain the spectrum of the data after projection into feature space
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
minK ,U
trace (KL )+12U (K+ γC KL+
IC )
−1
U T
s.t.K≽0
trace (K p)⩽B−l⩽n p−nq⩽l
L=I−D−1/2WD−1/2
PSD
![Page 53: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/53.jpg)
Unsupervised Non-Parametric Kernel Learning
➔ Highly flexible, least amount of prior knowledge required
➔ Learn kernel and labels simultaneously
➔ Maintain the spectrum of the data after projection into feature space
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
minK ,U
trace (KL )+12U (K+ γC KL+
IC )
−1
U T
s.t.K≽0
trace (K p)⩽B−l⩽n p−nq⩽l
L=I−D−1/2WD−1/2
Avoid overfitting and assigning all points to
one cluster
![Page 54: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/54.jpg)
Unsupervised Non-Parametric Kernel Learning
➔ Highly flexible, least amount of prior knowledge required
➔ Learn kernel and labels simultaneously
➔ Maintain the spectrum of the data after projection into feature space
Unsupervised non-parametric kernel learning algorithm, Liu et. al., 2013
minK ,U
trace (KL )+12U (K+ γC KL+
IC )
−1
U T
s.t.K≽0
trace (K p)⩽B−l⩽n p−nq⩽l
L=I−D−1/2WD−1/2
Solution involves using SDP, eigendecomposition – O(n2) to O(n3) complexity
![Page 55: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/55.jpg)
Summary
Kernel clustering more accurate than linear clustering Challenges: Non-scalability, out-of-sample clustering and kernel selection Approximation and parallelization techniques to handle scalability Projections for out-of-sample clustering Unsupervised kernel learning is flexible but complex
![Page 56: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/56.jpg)
References
Kernel Theory
[1] Statistical Learning Theory, V.N. Vapnik
[2] Learning with Kernels, B. Scholkopf and A. Smola
Matrix Approximation
[3] On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning, Drineas and Mahoney
[4] CUR matrix decompositions for improved data analysis, Drineas and Mahoney
[5] Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling, Wang and Zhang
[6] Fast Computation of Low Rank Matrix Approximations, Achlioptas and McSherry
![Page 57: Kernel Clustering](https://reader031.fdocuments.us/reader031/viewer/2022020622/61ed1c0c0a26af302b245278/html5/thumbnails/57.jpg)
References
Random Projections
[7] Kernels as Features: On kernels,margins, and low-dimensional mappings, Balcan, Blum and Vempala
[8] Random Features for Large-Scale Kernel Machines, Rahimi and Recht
Kernel Learning
[9] Generalized Maximum Margin Clustering and Unsupervised Kernel Learning, Valizadegan and Jin