BIRCH: An Efficient Data Clustering Method for Very Large Databases
-
Upload
callum-holman -
Category
Documents
-
view
52 -
download
2
description
Transcript of BIRCH: An Efficient Data Clustering Method for Very Large Databases
![Page 1: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/1.jpg)
BIRCH: An Efficient Data Clustering Method for Very Large Databases
Tian Zhang, Raghu Ramakrishnan, Miron Livny
University of Wisconsin-Maciison
Presented by Zhirong Tao
![Page 2: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/2.jpg)
Outline of the Paper Background Clustering Feature and CF Tree The BIRCH Clustering Algorithm Performance Studies
![Page 3: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/3.jpg)
Background A cluster is a collection of data
objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.
![Page 4: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/4.jpg)
Background (Contd) Given N d-dimensional data points in
a cluster: {Xi} where i = 1, 2, …, N, the centroid X0, radius R and diameter D of the cluster are defined as:
![Page 5: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/5.jpg)
Background (Contd)Given the centroids of two clusters: X01 and X0
2, The centroid Euclidean distance D0:
The centroid Manhattan distance D1:
![Page 6: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/6.jpg)
BIRCH: Hierarchical Method A distance-based approach:
Assume there is a distance measurement between any two instances .
Represent clusters by some kind of ‘center’ measure.
A hierarchical clustering a sequence of partitions in which each
partition is nested into the next partition in the sequence.
![Page 7: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/7.jpg)
Clustering Feature Definition
Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,
CF = (N, LS, SS) N is the number of data points in the clus
ter, LS is the linear sum of the N data points, SS is the square sum of the N data points.
![Page 8: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/8.jpg)
CF Additive Theorem Assume that CF1 = (N1, LS1, SS1), and CF2 =
(N2 ,LS2, SS2) are the CF entries of two disjoint subclusters.
The CF entry of the subcluster formed by merging the two disjoint subclusters is:CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)
The CF entries can be stored and calculated incrementally and consistently as subclusters are merged or new data points are inserted.
![Page 9: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/9.jpg)
CF-Tree A CF-tree is a height-balanced tree with two
parameters: branching factor (B for nonleaf node and L for leaf node) and threshold T.
The entry in each nonleaf node has the form [CFi, childi]
The entry in each leaf node is a CF; each leaf node has two pointers: `prev' and`next'.
Threshold value T: the diameter (alternatively, the radius) of each leaf entry has to be less than T.
![Page 10: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/10.jpg)
BIRCH Algorithm Overview
![Page 11: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/11.jpg)
Phase 1
![Page 12: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/12.jpg)
Insertion Algorithm Identifying the appropriate leaf Modifying the leaf: assume the closest leaf e
ntry, say Li, Li can `absorb' `Ent' Add a new entry for `Ent' to the leaf Split the leaf node
Modifying the path to the leaf: The parent has space for this entry Split the parent, and so on up to the root
![Page 13: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/13.jpg)
Phase 3: Global Clustering Use an existing global or semi-global algorit
hm to cluster all the leaf entries across the boundaries of different nodes.
This way we can overcome Anomaly 1: Anomaly 1: Depending upon the order of data in
put and the degree of skew, it is also possible that two subclusters that should not be in one cluster are kept in the same node.
![Page 14: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/14.jpg)
Comparison of BIRCH and CLARANSWith synthetic generated dataset:
![Page 15: BIRCH: An Efficient Data Clustering Method for Very Large Databases](https://reader036.fdocuments.us/reader036/viewer/2022082711/56813654550346895d9dda0a/html5/thumbnails/15.jpg)
Summary Compared with previous distance-
based approached (e.g, K-Means and CLARANS), BIRCH is appropriate for very large datasets.
BIRCH can work with any given amount of memory, and the I/O complexity is a little more than one scan of data.