Cluto – Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy.
-
Upload
jack-oneal -
Category
Documents
-
view
223 -
download
1
Transcript of Cluto – Clustering toolkit by G. Karypis, UMN Andrea Tagarelli Univ. of Calabria, Italy.
Cluto – Clustering toolkit by G. Karypis, UMN
Andrea TagarelliUniv. of Calabria, Italy
What is CLUTO?
• CLUstering Toolkit for very large, high dimensional & sparse datasets http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
• Main characteristics• Seeks to optimize a particular clustering criterion function• Identifies the features that best describe and discriminate each cluster• Allows for visually examining relations between clusters, objects, and features• Handles sparsity and requires memory as roughly linear in the input size
• Analysis Goals• To understand relations between objects assigned to each cluster and relations
between the different clusters• To visualize the discovered clustering solution
• Distributions• Stand-alone programs (vcluster and scluster)• Library via an application program can access CLUTO algorithms
Clustering algorithms
• Programs:• vcluster: takes as input a multidimensional representation of
the objects to be clustered• scluster: takes as input the object similarity graph
• Parameter: -clmethod=string• Partitional • Direct k-way clustering (direct)• Bisecting k-way clustering (rb, rbr)
• Agglomerative hierarchical (agglo)• Partitional-based agglomerative hierarchical (bagglo)
• Graph-partitioning-based (graph)
Usage
• MatrixFile: the file that stores the objects to be clustered• GraphFile: the file that stores the adjacency matrix of the object similarity graph• NClusters: the number of clusters• Optional parameters: categorized into three groups
• specified using –paramname or –paramname=value• categorized into three groups
1. control various aspects of the clustering algorithm2. control type of analysis and reporting that is performed computed clusters3. control visualization of the clusters
• Output clustering solution is stored in a file named • File.clustering.NClusters
vcluster [option parameters] MatrixFile NClusters scluster [option parameters] GraphFile NClusters
Input file format: matrix file
• Plain text with n+1 lines storing the data matrix for n m-dimensional objects
• Dense format• Metadata (in the first line): #rows, #columns• Each remaining line contains space-separated float values
• Sparse format• Metadata (in the first line): #rows, #columns, #nonzero
entries
Input file format: graph file
• Plain text with n+1 lines storing the adjacency matrix of the graph that specifies the similarity between the n objects
• Dense format:• Metadata (in the first line): #vertices (n)• Each of the remaining n lines stores n space-separated
floating point values such that the ith value corresponds to the similarity to the ith vertex of the graph
• Sparse format:• Metadata (in the first line): #vertices (n) and #edges• Each of the remaining n lines contains the index of the
adjacent vertex followed by the similarity of the corresponding edge
Input file format: labels
• Row label file:• Stores the label for each of the rows of the matrix (objects)• -rlabelfile param
• Column label file:• Stores the label for each of the columns of the matrix
(attributes)• -clabelfile param
• Row class label file• Stores the class-label for each of the rows of the matrix
(objects)• -rclassfile param
Output file format
• Clustering solution file• n lines, with a single number per line• ith line contains the cluster number that the ith object/row/vertex
belongs to• Cluster numbers run from zero to the number of clusters minus one• If –zscores is specified, each line of this file contains two additional
numbers right after the cluster number• internal z-score, external z-score
• Tree file• produced by performing AHC on top of a k-way clustering solution• stored into a file in the form of a parent array:• 2k-1 lines such that the ith line contains the parent of the ith node of the tree• In the case of the root node, which is stored in the last line of the file, the
parent is set to –1.
Output example
• Matrix/Graph information• Settings• Clustering/Clusters quality statistics• Timing information
Internal clustering quality
External clustering quality
• Comparison with reference classification (via –rclassfile)
• Overall Entropy and Purity• For each cluster• Local entropy and purity• Object distribution over the
classes
Cluster description
• Determine the best set of descriptive & discriminating features for each cluster (via –showfeatures)
• For each cluster• Top-L most descriptive features,
with % of the within cluster sim.• Top-L most discriminating
features, with % of the dissim. between the cluster and the rest of the objects
Cluster tree (1/2)
• via –showtree• Displayed in a rotated fashion• First column as the root, the tree
grows from left to right• The leaves are numbered from
Nclusters to 2*Nclusters -2• If –rclassfile is specified:• prints information about how the
objects of the various classes are distributed in each cluster
Cluster tree (2/2)
• via –showtree and -laveltree• Further statistics on each of the
the clusters• Size• Isim• Xsim: avg sim between the objects of each
pair of clusters that are children of the same node of the tree
• Gain: change in the value of a particular clustering criterion function
Cluster visualization
Cluster visualization