Cluto – Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy.

Cluto – Clustering toolkit by G. Karypis, UMN

Andrea TagarelliUniv. of Calabria, Italy

What is CLUTO?

• CLUstering Toolkit for very large, high dimensional & sparse datasets http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download

• Main characteristics• Seeks to optimize a particular clustering criterion function• Identifies the features that best describe and discriminate each cluster• Allows for visually examining relations between clusters, objects, and features• Handles sparsity and requires memory as roughly linear in the input size

• Analysis Goals• To understand relations between objects assigned to each cluster and relations

between the different clusters• To visualize the discovered clustering solution

• Distributions• Stand-alone programs (vcluster and scluster)• Library via an application program can access CLUTO algorithms

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download



Clustering algorithms

• Programs:• vcluster: takes as input a multidimensional representation of

the objects to be clustered• scluster: takes as input the object similarity graph

• Parameter: -clmethod=string• Partitional • Direct k-way clustering (direct)• Bisecting k-way clustering (rb, rbr)

• Agglomerative hierarchical (agglo)• Partitional-based agglomerative hierarchical (bagglo)

• Graph-partitioning-based (graph)

Usage

• MatrixFile: the file that stores the objects to be clustered• GraphFile: the file that stores the adjacency matrix of the object similarity graph• NClusters: the number of clusters• Optional parameters: categorized into three groups

• specified using –paramname or –paramname=value• categorized into three groups

1. control various aspects of the clustering algorithm2. control type of analysis and reporting that is performed computed clusters3. control visualization of the clusters

• Output clustering solution is stored in a file named • File.clustering.NClusters

vcluster [option parameters] MatrixFile NClusters scluster [option parameters] GraphFile NClusters

Input file format: matrix file

• Plain text with n+1 lines storing the data matrix for n m-dimensional objects

• Dense format• Metadata (in the first line): #rows, #columns• Each remaining line contains space-separated float values

• Sparse format• Metadata (in the first line): #rows, #columns, #nonzero

entries

Input file format: graph file

• Plain text with n+1 lines storing the adjacency matrix of the graph that specifies the similarity between the n objects

• Dense format:• Metadata (in the first line): #vertices (n)• Each of the remaining n lines stores n space-separated

floating point values such that the ith value corresponds to the similarity to the ith vertex of the graph

• Sparse format:• Metadata (in the first line): #vertices (n) and #edges• Each of the remaining n lines contains the index of the

adjacent vertex followed by the similarity of the corresponding edge

Input file format: labels

• Row label file:• Stores the label for each of the rows of the matrix (objects)• -rlabelfile param

• Column label file:• Stores the label for each of the columns of the matrix

(attributes)• -clabelfile param

• Row class label file• Stores the class-label for each of the rows of the matrix

(objects)• -rclassfile param

Output file format

• Clustering solution file• n lines, with a single number per line• ith line contains the cluster number that the ith object/row/vertex

belongs to• Cluster numbers run from zero to the number of clusters minus one• If –zscores is specified, each line of this file contains two additional

numbers right after the cluster number• internal z-score, external z-score

• Tree file• produced by performing AHC on top of a k-way clustering solution• stored into a file in the form of a parent array:• 2k-1 lines such that the ith line contains the parent of the ith node of the tree• In the case of the root node, which is stored in the last line of the file, the

parent is set to –1.

Output example

• Matrix/Graph information• Settings• Clustering/Clusters quality statistics• Timing information

Internal clustering quality

External clustering quality

• Comparison with reference classification (via –rclassfile)

• Overall Entropy and Purity• For each cluster• Local entropy and purity• Object distribution over the

classes

Cluster description

• Determine the best set of descriptive & discriminating features for each cluster (via –showfeatures)

• For each cluster• Top-L most descriptive features,

with % of the within cluster sim.• Top-L most discriminating

features, with % of the dissim. between the cluster and the rest of the objects

Cluster tree (1/2)

• via –showtree• Displayed in a rotated fashion• First column as the root, the tree

grows from left to right• The leaves are numbered from

Nclusters to 2*Nclusters -2• If –rclassfile is specified:• prints information about how the

objects of the various classes are distributed in each cluster

Cluster tree (2/2)

• via –showtree and -laveltree• Further statistics on each of the

the clusters• Size• Isim• Xsim: avg sim between the objects of each

pair of clusters that are children of the same node of the tree

• Gain: change in the value of a particular clustering criterion function

Cluster visualization

Cluto – Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy.

Documents

Transcript of Cluto – Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy.