Gaussian Process Clustering - TU Berlin · 2015. 1. 13. · Project in Artificial Intelligence and...

Technical University of Berlin

Gaussian Process Clustering Project in Artificial Intelligence and Machine Learning WS 2013/14

Fangzhou Yang (352040)

Jing Cao (352030)

2014-4-17

Project in Artificial Intelligence and Machine Learning Clustering Algorithm based on Gaussian Process

1

Content

1. Introduction ............................................................................................................................. 2

2. Background -- Gaussian Process (GP) ...................................................................................... 2

2.1 Gaussian Process .............................................................................................................. 2

2.2 Gaussian Process for Regression ...................................................................................... 3

2.3 Support Estimate from Gaussian Process ............................................................................. 3

3. Gaussian Process for Clustering .............................................................................................. 4

3.1 Clustering based on the variance function ...................................................................... 4

3.2 Dynamical System for cluster characterization................................................................ 5

3.3 GP Clustering Algorithm ................................................................................................... 6

4. Implementation of GPC Package ............................................................................................. 7

4.1 Gaussian Process Clustering Algorithm ................................................................................. 7

4.2 Measures for clustering Performance................................................................................... 8

4.3 Virtualization ......................................................................................................................... 9

5. Test & Evaluation ..................................................................................................................... 9

5.1 Dataset ........................................................................................................................... 10

5.2 GP Clustering Algorithm Testing .................................................................................... 10

5.2.1 Test with R15 Dataset ................................................................................................... 10

5.2.2 Test with Spiral Dataset ................................................................................................ 14

5.2.3 Test with Iris Dataset .................................................................................................... 14

5.3 Evaluation ....................................................................................................................... 15

6. A Clustering Application for location-based Data ................................................................. 17

7. Conclusion ............................................................................................................................. 19

8. Acknowledgements ............................................................................................................... 20

9. References ............................................................................................................................. 21


2

1. Introduction

In 1994, when the field of neural networks was becoming mature and shows a big complexity in

decisions making, Researchers found out that with an infinite size, some of the neural networks

turns to Gaussian process which made Gaussian Process a good candidate to help with simplifying

practical machine learning problems.[1] Since then Gaussian process has been widely used and

shows great advantages in supervised learning problems like regression, classification. This leads

to a question if Gaussian process can be also efficient in solving unsupervised problems like

clustering.

This report summarizes our project focusing on this problem. The task of the project is to develop

an understanding of the idea for clustering with Gaussian process models, according to the work

of Hyun-Chul Kim and Jaewook Lee [2]. Based on that, we implement a Gaussian Process

Clustering Python Package, and perform some clustering tests with different datasets.

This report is organized as follows: Section 2 will describe the definition and main properties of

Gaussian Process; The clustering algorithm based on Gaussian process proposed by Hyun-Chul

Kim and Jaewook Lee [2] will be introduced in Section 3; Section 4 focuses on our implementation

of the Gaussian Process Clustering Package (GPC) based on this algorithm; In section 5 we will

test and evaluate the clustering performance of Gaussian Process Clustering algorithm with

different data sets; Section 6 presents a demo application of Gaussian Process Clustering

Algorithm; Finally, the project is concluded in Section 7 and the acknowledgement as well as the

references are given by the last two sections.

2. Background -- Gaussian Process (GP)

In this section, some basic knowledge of Gaussian process is introduced to ease the discussion of

the clustering algorithm based on Gaussian process.

2.1 Gaussian Process A Gaussian process is defined as follows:

Definition: A Gaussian Process is a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions.[3]

Some properties of Gaussian process make it widely applied and easier to be analyzed. First of all, Gaussian process can approximately describe many natural phenomenon. This makes it a basic model for abstracting practical problems. Second, it has many good algebra properties like


3

a linear combination of Gaussian distribution still follows Gaussian distribution, or a Gaussian process can be fully specified by its mean function and covariance function[3]. These properties make it easier for calculation and analysis.

2.2 Gaussian Process for Regression Since a Gaussian process is fully specified by its mean function and covariance function, the Gaussian process regression will be reviewed first to get these functions.[4] Suppose that the training data points xi ∈ ℜ𝑑 with continuous target values ti form a data set D. Then the regression problem is to find the predictive distribution of the target value �̃� for a

new data point �̃� .The target function 𝑓 represents the mapping relationship of �̃� = 𝑓(�̃�) .

Gaussian process regression will assume 𝑓 has a Gaussian process prior. Then the density of any collection of target function values is modeled as a multivariate Gaussian density. The covariance matrix C is made up by terms satisfying 𝐶𝑖𝑗 = 𝐶(𝑥𝑖, 𝑥𝑗; Θ) which is a

parameterized function with the hyperparameters Θ .

When we know 𝐶𝑁 , then 𝐶𝑁+1 = (𝐶𝑁 𝑘

𝑘𝑇 𝑐) (2.2.1),

where k = [C(�̃�, 𝑥1; Θ), C(�̃�, 𝑥2; Θ), … , C(�̃�, 𝑥𝑛; Θ)]𝑇 , c = C(�̃�, �̃�; Θ). Then the variance is 𝜎2(𝑥) = 𝑐 − 𝑘𝑇𝐶−1𝑘. (2.2.2) The covariance function

C(𝑥𝑖, 𝑥𝑗; Θ) = 𝑣0exp {−1

2∑ 𝑙𝑚(𝑥𝑖

𝑚 − 𝑥𝑗𝑚)2} + 𝑣1 + 𝛿𝑖𝑗𝑣2

𝑑𝑚=1 (2.2.3)

is the kernel function. The hyperparameters Θ = {𝑙𝑚|𝑚 = 1, … , 𝑑} ∪ {𝑣0, 𝑣1, 𝑣2} can be learned from the data.

2.3 Support Estimate from Gaussian Process From Section 2.2, we got the variance function of Gaussian process. Take a 2-dimensional data

set to be observed as Figure 1, it can be easily seen that the variances of predictive values have

smaller values in a denser area and larger values in a sparse area.[5] Which means the value of

the variances can indicate the density of the input data points. Since it is common to apply the

support of a probability density function to clustering problems, we can take the variances as a

good estimate of the support of a probability function.


4

Figure 1 Examples of Gaussian process regression. Variance of the predictive value is related to the density

3. Gaussian Process for Clustering

In this section we will introduce the clustering algorithm base on the observation of the variance

function, and relevant mathematical analysis and explanations for the key definitions [2] will be

given out.

3.1 Clustering based on the variance function Let the cutting level 𝑟∗ represent the contours of the clusters, set it first as

𝑟∗ = 𝑚𝑎𝑥𝑘𝜎2(𝑥𝑘) (3.1.1)

Then a cluster is given by {x: 𝜎2(𝑥) < 𝑟∗} and the data set can be decomposed into several

disjoint connected sets

L(𝑟∗) ≔ {x: 𝜎2(𝑥) ≤ 𝑟∗} = 𝐶1 ∪ … ∪ 𝐶𝑃 (3.1.2)

To apply Gaussian process in clustering problem is to develop an algorithm separate the input

data points into these disjoint connected sets.


5

3.2 Dynamical System for cluster characterization The algorithm needs some lemmas to support. So in this section we will describe how to assign a

data point into the corresponding cluster.

A dynamical system is built as follows:

𝑑𝑥

𝑑𝑡= F(x) ≔ −∇𝜎2(𝑥) (3.2.1)

In the system, for each initial state x(0) = 𝑥0 , there is a unique time evolution solution

(trajectory). On the trajectory, the state vector �̅� ∈ ℜ𝑛 satisfying F(�̅�) = 0 is called an

equilibrium point of the dynamical system and if the eigenvalues of the state vector’s derivative

are positive, it is called the (asymptotically) stable equilibrium point (SEP).

Figure 2 The partitioning property of the proposed clustering algorithm. (The dashed lines represent the basin boundary separating each clusters, and thearrows represent the direction of the system trajectories)

As what Figure 2 shows, the vector field F(x) in equation (3.2.1) is orthogonal to the contour

{y: 𝜎2(y) = 𝑟∗} and points inward the surface, which makes every trajectory remain in one of

the clusters. This property leads to two lemmas which are lied on by the clustering algorithm.

Lemma 1 For any given level value r > 0, each connected component of the level set L(r) ={x: 𝜎2(𝑥) ≤ r} is positively invariant, that is, if a point is on a connected component of L(r), then its entire positive trajectory lies on the same component.


6

Lemma 2 The trajectory of process 4.2 approaches one of the equilibrium points of the equation. In particular, almost all the trajectory approaches one of stable equilibrium points of equation (3.2.1). Lemma 1 points out that if any point on a trajectory is found to be within one cluster, then all the other points on the same trajectory are also within the same cluster. Lemma 2 tells that several trajectories will finally approach a same stable equilibrium point which can be found through equation (3.2.1) . These two lemmas give the basic idea of GP clustering algorithm. The problem how to assign all the data points to corresponding clusters can be simplified into assigning the relevant stable equilibrium points to the clusters. Now the problem is how to assign the stable equilibrium points into the clusters. If the SEPs are located in two different clusters, then a straight line connecting them must have a segment point y on it that y satisfies 𝜎2(𝑦) > 𝑟∗ , if there is no such points on the line, then the two SEPs should be within the same cluster. Based on this idea, an adjacent matrix A is built to help with the algorithm.

𝐴𝑖𝑗 = {1 , 𝑖𝑓𝜎2(𝑦) ≤ 𝑟∗𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑝𝑜𝑖𝑛𝑡 𝑦 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑖 , 𝑠𝑗

0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3.2.2)

Where 𝑠𝑖 , 𝑠𝑗 represent the i-th SEP and j-th SEP.

3.3 GP Clustering Algorithm In a conclusion, the GP Clustering algorithm can be stated as the following four steps:

Step1: For given unlabeled training data X = {𝑥𝑖|i = 1,2, … , N}, construct a variance function 𝜎2(∙), in equation 3.1, and compute the level value 𝑟∗ = 𝑚𝑎𝑥𝑥𝑖∈𝑋𝜎2(𝑥𝑖)

Step 2: Using each data point as an initial value, we apply the process in equation 4.2 to find its corresponding stable equilibrium point, denoted by 𝑠𝑖, 𝑖 = 1, … , 𝑝 . Let 𝑋𝑖 be the set of training data points that converge to a stable equilibrium point 𝑠𝑖 for each i = 1, … , N

Step 3: For each pair of stable equilibrium point 𝑠𝑖, 𝑠𝑗 , 𝑖, 𝑗 = 1, … , 𝑝, i,,define an adjacency

matrix A with its elements:

𝐴𝑖𝑗 = {1 , 𝑖𝑓𝜎2(𝑦) ≤ 𝑟∗𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑝𝑜𝑖𝑛𝑡 𝑦 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑖 , 𝑠𝑗

0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3.2.2)

and assign the same cluster index to the stable equilibrium point in the same connected components of the graph induced by A. Step 4: For each i = 1, … , p, assign the same cluster label to all the training data points in 𝑋𝑖 .


7

4. Implementation of GPC Package

In this section, we will describe our implementation of Gaussian Process Clustering Package,

which contains the basic four steps of Gaussian Process Clustering Algorithm, three measures for

measuring the clustering performance, and some virtualization functions including a method for

PCA [6] high dimensional virtualization [7]. This package is implemented in Python, and has

package dependencies on SciPy [8], NumPy [9] and matplotlib [10].

The following three subsections will explain some important interfaces and methods of our

package, more specific implementation details and usages can be found in our code.

4.1 Gaussian Process Clustering Algorithm The Gaussian Process Clustering Algorithm is implemented by the four basic steps as follows:

a. Construct the covariance Function and Covariance Matrix:

The package uses a commonly used covariance function:

C(xi, 𝑥𝑗) = 𝑣0 exp{−1

2∑ 𝑙𝑚(𝑥𝑖

𝑚 − 𝑥𝑗𝑚)

2𝑑𝑚=1 } + 𝑣1 + 𝜎𝑖𝑗𝑣2 (2.2.3)

The hyper-parameters such as v0, 𝑣1, 𝑣2, 𝑙𝑚, can be set in the package to construct a proper

variance function.

After the variance function is constructed, we can calculate the covariance matrix with the

method: C = getCovarianceMatrix(data) .

b. Compute Stable Equilibrium Points

The function sep= getEquilibriumPoint(x,data,inv,ita,maxIteration) is implemented to

compute the SEP (Stable Equilibrium Point) for each data point in the dataset. We use

gradient descent method to generate trajectory for dynamic system, thus the convergence

point of gradient descent (gradient descent, 2014) is the stable equilibrium point that the

generated trajectory approaches. Here, sep is the stable equilibrium point of the data point

x, inv is the inverse of covariance matrix C, ita is the step size of gradient descent.

Because there are small errors for the computation of each SEP, the coordinate of SEPs that

are supposed to be one SEP, cannot be exactly the same. The function reduceSEPs(seps,

min_accepted_covariance) is designed to combine the SEPs, which are supposed to be one

SEP, to one SEP, it will return a reduced SEP list and the index map for mapping the data

points to reduced SEP list.


8

To check if two SEPs are the same one or not, we use the constructed covariance function

C(xi, 𝑥𝑗) , the closer the two SEPs are, the larger the value of their covariance is. The

parameter min_accepted_covariance is set as a criterion to combine SEPs. Therefore, if the

covariance value of two points are larger than the min_accepted_covariance, then we

determine that they are one SEP.

c. Construct Adjacency Matrix

The Function getAdjacencyMatrix(sepList, maxVar, pointsNumPerDistanceUnit, data, invA) is

implemented to compute the adjacency matrix A. We pick some checkpoints between two

arbitrary SEPs i and j in the SEP list, if the variance of all the checkpoints are smaller than

cutting level maxVar, then we determine that, these SEPs i and j are in the same cluster and

thus set in adjacency matrix A[i][j] and A[j][i] to 1. Because we will check all the pairs of SEPs,

a large number of checkpoints for each pair could be a computation bottleneck. However, if

the number of checkpoints are too small, for some pairs in which two data points are far away,

the checkpoints are loose and thus might omit some points whose variances are larger than

the cutting level.

In our implementation, we came up an idea to compromise this problem. The number of

checkpoints between two data points is no more fixed to a value, but variable and dependent

on the geometric distance of these two data points. The shorter the distance is, the less

checkpoints we need to make the decision. In this case, we can save lots of computation for

unnecessary checkpoints.

d. Assign Cluster Lables

The function getSEPsClusters(adjacencyMatrix, sepList) is to get the clusters of SEPs according

to the computed adjacency Matrix. Because all the data points which approach to the same

SEP belong to a same cluster (see the Lemmas in Section 3.2) , we can easily assign cluster

labels for each data point, which is implemented by the function

getPointClusters(sepsClusters, sepIndexMap).

4.2 Measures for clustering Performance To measure the clustering performance, we implement three measures such as the reference

error rate RE, cluster error rate CE, and F Score [2].

Here we define:

nr,c – the number of data points that belong to reference cluster r and are assigned to

cluster c by a clustering algorithm ( nr,c can be calculated via the function

getCountNrc(references, clusters, referencesNum, clustersNum) )

n – the number of all the data points


9

nr – the number of the data points in the reference cluster r

nc – the number of the data points in the cluster c obtained by a clustering algorithm

a. Reference Error Rate: RE = 1 −∑ max

𝑐𝑛𝑟,𝑐r

𝑛 (4.2.1)

The RE can be calculated via the function getRE(references, clusters, referencesNum,

clustersNum, Nrc)

b. Cluster Error Rate: CE = 1 −∑ max

𝑟𝑛𝑟,𝑐c

𝑛 (4.2.2)

The CE can be calculated via the function getCE(references, clusters, referencesNum,

clustersNum, Nrc)

c. F Score:

The F Score of the reference cluster r and cluster c is defined as: Fr,c =2𝑅𝑟,𝑐𝑃𝑟,𝑐

𝑅𝑟,𝑐+𝑃𝑟,𝑐 where Rr,c =

𝑛𝑟,𝑐

𝑛𝑐 represents Recall and Pr,c =

𝑛𝑟,𝑐

𝑛𝑟 represents Precision.

The F Score of the reference cluster r is the maximum F Score value over all the cluster as:

Fr = maxc

𝐹𝑟,𝑐

The overall F Score is defined as:

F Score = ∑𝑛𝑟

𝑛 𝐹𝑟r (4.2.3)

In general, the higher the F Score (maximum to 1) is, the better the clustering result becomes.

To get the F Score of a clustering result, we can use the function in our package

getFScore(references, clusters, referencesNum, clustersNum, Nrc)

4.3 Virtualization To virtualize the clustering result, we implement some functions to plot data points and their

cluster labels in a two-dimensional coordinate such as plotScatter(X,Y,title,c) and

plotCluster(X,Y,clusters,title). For higher dimensional dataset, we implement a PCA [6] function

pca(data,nRedDim), so that we can project our high dimensional data into the most two

important principle components, and plot data points and their cluster labels in the projected

coordinate [7].

5. Test & Evaluation

In this section, we will test Gaussian Process Clustering Algorithm by using our implemented

package. Besides, we will evaluate the clustering performance and compare it with a reference

clustering algorithm K-Means Clustering [11].


10

5.1 Dataset The datasets that we are using for algorithm testing are from Clustering dataset Joensuu of

University of Eastern Finland [12], which contains two two-dimensional datasets i.e. R15 and

Spiral Shape Sets (see figure 3) and one four-dimensional dataset i.e. Iris dataset. All these

datasets have referencing cluster labels so that we can evaluate our clustering results with

measures such as Reference Error Rate, Cluster Error Rate and F Score.

Figure 3 R15 and Spiral Shape Sets

5.2 GP Clustering Algorithm Testing

5.2.1 Test with R15 Dataset

In order to get the clustering result of Gaussian Process Clustering Algorithm, we will once more

follow the four basic steps as we introduced in the previous sections.

a. Construct the variance Function and Covariance Matrix:

To construct the default covariance function in our implemented package, we need to determine

the value of hyper-parameters {v0, 𝑣1, 𝑣2, 𝑙𝑚} in the covariance function. Usually we set v0 and

v2 to 1 and v1 to 0, and the parameters lm {𝑚 ∈ {1, … , 𝑑} can be regarded as the weight for each

dimension in the covariance function. For the dataset R15, the data points in both two

dimensions have the similar variance, thus we set a same value α for both l1 and l2. However,

different value of α has different effects on the variance for each data point. Figure 4 shows the

heat map of variance with different α value.


11

Figure 4 Heat Map of Variance with different 𝛼

As we can see, the value of α determines the distribution of the Gaussian process variance of

each data point. The higher the value of α, the sharper and rougher the distribution of the

variance, in contrast, the lower the value of α, the smoother and the vaguer the distribution of

the variance. What we want is to find a proper distribution, neither too sharp nor too smooth, so

that we can easily determine a cutting level to get a good clustering result. Therefore in this case,

we will set α = 1 and set the cutting level to some value from 0.5 to 0.6.

After the covariance function is determined, we can easily compute the covariance matrix by

using the method getCovarianceMatrix in the package.

b. Compute Stable Equilibrium Points

The Stable Equilibrium Point for each data point can be calculated and combined by the

functions in the package. The following figure shows the scatter plot of reduced SEPs:

Figure 5 Reduced SEPs Scatter Plot


12

As we can see, we have reduced the original 600 data points to 98 SEPs. According to the

previous lemmas, each data point will approach to one SEP, and all the data points that belong

to one SEP are in the same cluster.

c. Construct Adjacency Matrix

To construct the adjacency matrix, we need to determine the cutting level maxVar, different

cutting levels will lead to different clustering results, which we will present in the end of this

subsection. Here we will just set it to a value of 0.52.

After the adjacency matrix is constructed, we can get the clusters of SEPs. The following figure

is the scatter plot of clustered SEPs.

Figure 6 Scatter Pkot of SEPs

d. Assign Cluster Lables

Here we just simply assign the cluster label of each SEP to the corresponding data points. The

clustering result is shown in the following figure. As we can see, each cluster is marked by a

different color.

Figure 7 Clustering Result (Scatter Plot –GP Clustering mVar =0.52)


13

As what we mentioned, different cutting levels will lead to different clustering results, more

precisely, it will lead to different number of clusters.

Figure 8 Clustering Result (With different cutting levels)

Figure 9 Relationship between the number of clusters and the cutting level

Figure 8 shows the different clustering results with three different cutting levels. The data points

in the middle have a totally different cluster assignment. Figure 9 shows that, the higher the

cutting level, the more clusters in the clustering result.


14

5.2.2 Test with Spiral Dataset

The Spiral dataset is quite similar to R15 dataset, but with a different data shape. The data in the

same cluster is linked close one by one and stretch as a spiral, instead of staying together in a

small group.

The testing approach is quite similar to the R15 dataset, thus we won’t go into too many details

again. The following figure shows the clustering result for Spiral dataset. As we can see, the data

is perfectly clustered by Gaussian Process Clustering.

Figure 10 Scatter Plot—GP Clustering, mVar=0.99

5.2.3 Test with Iris Dataset

Iris Dataset has four dimensions, in order to virtualize the distribution of the dataset, we use

principle component analysis and project all data points to the most important two principle

components.

Figure 11 PCA—1st and 2nd components


15

The figure above shows the distribution of data points by projecting data into first and second

principle components. The three colors in the figure are the three referencing cluster labels.

To get a clustering result by Gaussian Process Clustering, we perform the same approaches as

mentioned above. One difference is that, we have to set different parameters for different

dimensions in the covariance function. It will also be more difficult to find a proper hyper-

parameter, because more dimensions means more parameters to determine.

Figure 12 GP Clustering PCA – 1st and 2nd components

Figure 12 shows the clustering result for Iris dataset. Comparing to the referencing clusters shown

in figure 11, the data points in the red reference cluster are well clustered. Although there are

still some error clustered points for the green and blue referencing clusters, the overall clustering

result for Iris Dataset is good.

5.3 Evaluation To evaluate the clustering results, we use the three measures, which are implemented in our

package, i.e. reference error rate, clustering error rate and F Score.

Table 1 shows the evaluation of the clustering result for Iris Dataset

Table 1 Evaluation result for Iris Dataset

CE RE F Score

GP Clustering 0.027 0.233 0.853

As we can see, although the reference error rate is larger than 0.2, the cluster error rate is quite

small. The F Score shows a good overall clustering performance, which means the Gaussian

Process Clustering algorithm can work well for the Iris dataset.


16

To have a better performance perspective of Gaussian Process Clustering Algorithm, we

implement a simple two-dimensional K-Means Clustering Algorithm as a referencing clustering

algorithm. And then we do the same clustering test for R15 and Spiral dataset but with K-Means

Clustering. Then we use the three measures to evaluate the clustering result for both Gaussian

Process Clustering and K-Means Clustering, the results are presented in the following figures.

Figure 13 R15 Dataset

Figure 14 Spiral Dataset


17

As we can see in the Figure 13, both Gaussian Process Clustering and K-Means Clustering have a

very good clustering performance on R15 dataset. By comparing the F Scores, K-Means Clustering

has even a slightly better clustering result. But it doesn’t mean K-Means has a much better

performance than Gaussian Process Clustering. The good result of K-Means is based on the

precise prio-knowledge of the number of clusters (K = 15), however in the real case, we usually

don’t know how many clusters we should have. In contract, Gaussian process clustering can

provide a much flexible strategy to change the number of clusters by modifying the cutting level.

In the Figure 14, we see a perfect clustering result for Gaussian Process Clustering on Spiral

dataset. However, the clustering result for K-Means is bad, although the precise prio-knowledge

of K is given. The result shows us another advantage of Gaussian Process Clustering, that Gaussian

Process Clustering has a good clustering performance, to detect clusters with arbitrarily complex

shapes.

6. A Clustering Application for location-based Data

In this section, we will shortly present a clustering application of Gaussian Process Clustering for

location-based data.

The basic idea of this application is to cluster schools based on their locations. As we know,

schools as a kind of basic infrastructure facilities in a city are distributed closely depending on the

distribution of population, more specifically are the distribution of city blocks or regions. These

blocks and regions may have different complex shapes based on their terrains. What we want is

to discover such blocks or districts according to the clusters of schools.

The dataset is the Location of ACT (Australian Capital Territory) Schools [13], which contains 132

schools in Canberra. The distribution of the schools in the city can be seen in Figure 15. In the

dataset, each school has a post code besides its location. Considering the post code is coded

according to their administrative blocks or regions, thus we expect that, schools of one cluster

haves same or similar post code.


18

Figure 15 Distribution of ACT Schools

Figure 16 shows the clustering results with different cutting levels. Here different cutting levels

lead to different clustering results, it provides us a flexible strategy to determine the number of

clusters and the average size of each cluster. In this case, different sizes of clusters have different

practical meanings. The Clusters in the left sub figure (maxVar = 0.1) are close to city blocks, on

the contrary, the clusters in the middle sub figure (maxVar = 0.2) are more like super blocks. In

contrast, the clusters in the right sub figure (maxVar = 0.3) can be regarded as city regions.

Besides, as we can see in these sub figures, Gaussian Process Clustering can detect complex

cluster shapes. The schools which stay close to each other will be assigned into a cluster, no

matter the terrain are narrow-and-long or wide-and-short or even in a circle around a lake.


19

Figure 16 Clustering result for the ACT schools with different cutting levels

7. Conclusion

In this project, we develop an understanding of the idea for clustering with Gaussian process

models, according to the work of Hyun-Chul Kim and Jaewook Lee [2]. Based on that, we

implement a Gaussian Process Clustering Python Package, and perform some clustering tests

with different datasets.

In Gaussian Process Clustering, the variance function is applied to construct a set of contours that

enclose the data points, which correspond to cluster boundaries. A dynamic process associated

with the variance function is built and applied to cluster labeling of data points. The results of our

clustering tests show that, Gaussian Process Clustering Algorithm has the ability to detect clusters

with arbitrarily complex shapes and also high-dimensional clusters. Besides, Gaussian Process

Clustering provides a flexible strategy to change the number of clusters and the average cluster

size by modifying the cutting level, which enable us to cluster the data set with overlapped

clusters and control the number of clusters.

In our clustering tests, we had to set a proper hyper-parameter so that we can get the good

clustering results. As we showed in the previous tests, the covariance function controls the

distribution of Gaussian process variance, while the cutting level controls the number of clusters

and average cluster size. It is important to estimate a proper hyper-parameter. However, it is not


20

an easy job to find the optimal hyper-parameter, and the complexity for determining hyper-

parameter increases as the dimension of the dataset grows. Therefore, to develop a robust

strategy to set a proper hyper-parameter is one of the future work of Gaussian Process Clustering

algorithm. Besides, Gaussian Process Clustering Algorithm is time costly, some strategies to

speed up the algorithm could be developed in the future so that the algorithm can not only has

a good clustering performance but also be fast and efficient.

8. Acknowledgements

This project is one of projects in Machine Learning and Artificial Intelligence WS 2013/14, which

are supported by Lab KI [14] and Professor Dr. Manfred Opper. Special thanks to our supervisors

Florian Stimberg and Andreas Ruttor for the supports and helps in this project.


21

9. References

[1] C. E. Rasmussen, C. K. I. Williams, G. Processes, M. I. T. Press, and M. I. Jordan, Gaussian Processes for Machine Learning. 2006.

[2] H.-C. Kim and J. Lee, “Clustering based on gaussian processes.,” Neural Comput., vol. 19, no. 11, pp. 3088–107, Nov. 2007.

[3] C. E. Rasmussen, “Gaussian Processes in Machine Learning.”

[4] C. E. Rasmussen, “EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NON-LINEAR REGRESSION Evaluation of Gaussian Processes and other Methods for Non-Linear Regression Abstract,” 1996.

[5] H.-C. Kim and J. Lee, “Pseudo-density Estimation for Clustering with Gaussian Processes,” in Advances in Neural Networks - ISNN 2006 SE - 183, vol. 3971, J. Wang, Z. Yi, J. Zurada, B.-L. Lu, and H. Yin, Eds. Springer Berlin Heidelberg, 2006, pp. 1238–1243.

[6] M. Ringnér, “What is principal component analysis?,” Nat. Biotechnol., vol. 26, no. 3, pp. 303–4, Mar. 2008.

[7] G. Grinstein, M. Trutschl, and U. Cvek, “High-Dimensional Visualizations Visualization Taxonomies Matrix of Scatterplots.”

[8] “SciPy.org — SciPy.org.” [Online]. Available: http://www.scipy.org/. [Accessed: 17-Apr-2014].

[9] “NumPy — Numpy.” [Online]. Available: http://www.numpy.org/. [Accessed: 17-Apr-2014].

[10] “matplotlib: python plotting — Matplotlib 1.3.1 documentation.” [Online]. Available: http://matplotlib.org/. [Accessed: 17-Apr-2014].

[11] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: analysis and implementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 881–892, Jul. 2002.

[12] “Clustering datasets.” [Online]. Available: http://cs.joensuu.fi/sipu/datasets/. [Accessed: 12-Apr-2014].

[13] “ACT School Locations,” ACT Government. [Online]. Available: https://www.data.act.gov.au/Education/ACT-School-Locations/q8rt-q8cy. [Accessed: 13-Apr-2014].

[14] “Methoden der Künstlichen Intelligenz.” [Online]. Available: http://www.ki.tu-berlin.de/menue/methoden_der_kuenstlichen_intelligenz/. [Accessed: 13-Apr-2014].

Gaussian Process Clustering - TU Berlin · 2015. 1. 13. · Project in Artificial Intelligence and...

Documents

Transcript of Gaussian Process Clustering - TU Berlin · 2015. 1. 13. · Project in Artificial Intelligence and...