Automated Clustering Project - 12th CONTECSI 34th WCARS

16
Automated Clustering Project Miklos Vasarhelyi, Paul Byrnes, and Yunsen Wang Presented by Deniz Appelbaum

Transcript of Automated Clustering Project - 12th CONTECSI 34th WCARS

Automated Clustering Project Miklos Vasarhelyi, Paul Byrnes, and Yunsen Wang

Presented by Deniz Appelbaum

Motivation

Motivation entails the development of a program that automatically performs clustering and outlier detection for a wide variety of numerically represented data.

Outline of program features

Normalizes all data to be clustered

Creates normalized principal components from the normalized data

Automatically selects the necessary normalized principal components for use in actual clustering and outlier detection

Compares a variety of algorithms based upon the selected set of normalized principal components

Adopts the top performing model based upon silhouette coefficient values to perform the final clustering and outlier detection procedures

Produces relevant information and outputs throughout the process

Data normalization

Data normalization

Converts each numerically represented dimension to be clustered into the range [0,1].

A desirable procedure for preparing numeric attributes for clustering

Principal component analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

In this way, PCA can both reduce dimensionality as well as eliminate inherent problems associated with clustering data whose attributes are correlated

In the following slides, a random sample of 5,000 credit card customers is used to demonstrate the automated clustering and outlier detection program

Principal component analysis

PCA initially results in four principal components being generated from the original data

Using a cumulative data variability threshold of 80% (default specification), three principal components are automatically selected for analysis – they explain the vast majority of data variability

Principal component analysis

Scatter plot of PC1 and PC2

In this view, the top 2 principal components are plotted for each object in two-dimensional space.

As can be seen, a small subset of records appear significantly more distant/different from the vast majority of objects.

Clustering exploration/simulation process - examples

Ward method Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for

choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.

Complete link method This method is also known as farthest neighbor clustering. The result of the clustering can be visualized

as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.

PAM (partitioning around medoids) The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift

algorithm; It is considered more stable than k-means, because it uses the median rather than mean

K-means k-means clustering aims to partition n observations into k clusters in which each observation belongs to

the cluster with the nearest mean, serving as a prototype of the cluster.

Clustering exploration results

The result shown below is based upon a simulation exercise, whereby all four algorithms are automatically compared on the data set (i.e., a random sample of 5,000 records from the credit card customer data). In this particular case, the best model is found to be a two-cluster solution using the complete link hierarchical method. This is the final model and is used for subsequent clustering and outlier detection.

Best clustering result:

The silhouette value can theoretically range from -1 to +1, with higher values indicative of better cluster quality in terms of both cohesion and separation.

Best Method Number Of Clusters Silhouette Value

complete link hierarchical 2 0.753754205720575

Complete-link Hierarchical clustering (1/2)

The 5,000 instances are on the x-axis. In moving vertically from the x-axis, one can begin to see how the actual clusters are formed.

Plot of PCs with cluster assignment labels (1/3)

In this view, the top two principal components (i.e., PC1 and PC2) are plotted for each object in two-dimensional space.

In the graph, there are two clusters, one dark blue and the other light blue.

The small subset of three records appears substantially more different from the majority of objects.

Plot of PCs with cluster assignment labels (2/3)

In this view, PC1 and PC3 are plotted for each object in two-dimensional space.

In the graph, the two clusters are again shown.

It is once again evident that the small subset of three records appears more different from the majority of other objects.

Plot of PCs with cluster assignment labels (3/3)

In this view, PC2 and PC3 are plotted for each object in two-dimensional space.

Cluster differences appear less prominent from this perspective.

Principal components 3D scatterplot

Cluster one represents the majority class (black) while cluster two represents the rare class (red).

In this view, one can clearly see the subset of three records (in red) appearing more isolated from the other objects.

Cluster 1 outlier plot

In this view, an arbitrary cutoff is inserted at the 99.9th percentile (red horizontal line) so as to provide for efficient identification of very irregular records.

Objects further from the x-axis are more questionable.

While all objects distant from the x-axis might be worth investigating, points above the cutoff should be viewed as particularly suspicious.

Conclusion of Process

At the conclusion of outlier detection, an output file for each cluster containing the unique record identifier, original variables, normalized variables, principal components, normalized principal components, cluster assignments, and mahalanobis distance information can be exported to facilitate further analyses and investigations.

Cluster 2 – final output file of a subset of fields:

Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2) Very high incidence of late payments, and 3) Relatively high credit limits, particularly given the account age and late payment issues.

Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md

32430 1 2500 1 3 2 5.83E-05

65470 1 8500 1 4 2 0.002371778

78772 1 2200 0 3 2 0.000442305