Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering...
-
Upload
shanna-francis -
Category
Documents
-
view
228 -
download
2
Transcript of Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering...
![Page 1: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/1.jpg)
• Tian Zhang
• Raghu Ramakrishnan
• Miron Livny
• Presented by: Peter Vile
BIRCH: A New data clustering Algorithm and Its Applications
![Page 2: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/2.jpg)
• Data Clustering:
Problem of dividing N data points into K groups so as to minimize an intra-group difference metric.
Many Algorithms already exists
Problem:
Due to abundance of local minima, there is no way to find a globally minimal solution without trying all possible partitions.
![Page 3: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/3.jpg)
• Probability-Based Clustering
• COBWEB– use probabilistic measurements for making
decisions (Category Utility)– Hierarchical, Iterative
• Disadvantage– Category Utility takes time and memory– tends to over fit
Other Methods
![Page 4: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/4.jpg)
Cobweb’s Limits
-assumes probability distributions of attributes are independent
-can only have discrete values, approximates continuous data with discretization
-storing and manipulating large sets of data becomes infeasible
![Page 5: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/5.jpg)
The Competition
• Distance-Based Clustering
• KMEANS, CLARAN
• CLARAN– like Kmeans– Node - set of medians– starts in random node and moves to closest
neighbor
![Page 6: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/6.jpg)
BIRCH
• Good -doesn’t assume attributes independent -minimizes memory useage -scans data once from disk. -can handle very large data sets (use the
concept of summarization) -exploits the observation that not every data
point is important for clustering purposes
![Page 7: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/7.jpg)
Limitations of BIRCH
• Handles only metric data
![Page 8: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/8.jpg)
Definitions
• Centroid– avg value
• Radius – std dev
• Diameter– avg. pairwise distance
within a cluster
N
XX
N
i i 10
2
1
1
20
N
XXR
N
i i
2
1
1 1
2
1
NN
XXD
N
i
N
j ji
![Page 9: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/9.jpg)
How Far Away• Given the centroids of two clusters: • centroid Euclidean distance D0
• centroid Manhattan distance D1
– d is the dimension
d
i
ii XXXXD1
)(2
)(121 |00||00|1
212
21 000 XXD
![Page 10: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/10.jpg)
More Distances
• Average Inter-cluster Distance (D2)
• Average Intra-cluster Distance (D3)
• Variance Increase Distance (D4)
21
1 21
1
21
1 1
2
2
NN
XXD
N
i
NN
Nj ji
2
1
2121
1 1
2
13
21 21
NNNN
XXD
NN
i
NN
j ji
2
1
1 1
2
2
1
2
1
1
2
121
1 1 21
1
21
1
1
21
21
4
N
i
NN
Nj
NN
Nl l
j
N
l li
NN
k
NN
l lk N
XX
N
XX
NN
XXD
![Page 11: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/11.jpg)
Clustering Feature (CF)• “A Clustering Feature (CF) entry is a
triple summarizing the information that we maintain about a sub-cluster of data points.”
• CF Definition: CF = (N, , SS)– N : number of data points in the cluster– : Linear sum of the N data points, – SS : Square sum of the N data points,
SL
SL
N
i iX1
N
i iX1
2
![Page 12: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/12.jpg)
• CF Representativity Theorem: Given CF entries for all sub-clusters, all the measurements, Q1 and Q2, can be computed accurately
![Page 13: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/13.jpg)
• CF Additivity Theorem: Assume that CF1 and CF2 are the CF entries of two disjoint sub-clusters. Then the CF entry of the sub-cluster that is formed by merging the two disjoint sub-clusters is :
1111 ,, SSSLNCF
2222 ,, SSSLNCF
21212121 ,, SSSSSLSLNNCFCF
![Page 14: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/14.jpg)
CF-tree Features
• Has two parameters– 1. Branching Factor
• B - non-leaf node ( [CF i, child i], i = 1..B)– child i - pointer to its i-th child node
• L - leaf node ( [CF j, prev, next], j = 1..L)
– 2. Threshold • specify the size of each leaf entry
– diameter(D) of each leaf entry < T
– or radius(R) of each leaf entry < T
![Page 15: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/15.jpg)
CF-tree Features(Continue)• Tree size is a function of T
– tree size = f(T)
– T increases tree size decreases
• Page Size (P)– A node is required to fit in a page of size P– P can be varied for performance tuning
• CF tree will be built dynamically as new data objects are inserted
![Page 16: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/16.jpg)
CF-tree
• Two Algorithms used to build a CF-tree– 1. Insertion Algorithm
• Purpose: Building a CF-Tree
– 2. Rebuilding Algorithm• Purpose:
– Rebuild the whole tree with larger T (smaller size)
– this happens when CF-tree size limit is exceeded
![Page 17: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/17.jpg)
Insertion Algorithm
• 1. Identifying the appropriate leaf– non-leaf node– use distance metric to chose closest branch
• 2. Insert leaf into leaf node– merge with closest leaf [CF i, prev, next]– if T violated make new leaf [CF i+1, prev, next]– if L violated split into two leaf nodes
• choose two leaves that are farthest apart
• put the rest in leaf node with the closest leaf
![Page 18: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/18.jpg)
Insertion Algorithm
• 3. Update tree path– if B is violated split the node– CF should be the sum of child CFs
• 4. A Merging Refinement– try to combine two closest children of the node
that did not split– might free space
![Page 19: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/19.jpg)
Rebuilding Algorithm
• When to do it?– If the CF-tree size limit is exceeded
• What does it do?– Creates new tree with larger T(diameter/radius)
• larger T -> smaller tree size
• How– Deletes path from old tree and adds path to new
tree.
![Page 20: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/20.jpg)
Rebuilding Algorithm
• Procedure– ‘OldCurrentPath’ starts at leftmost branch– 1. Create the ‘NewCurrentPath’ in new tree
• copy nodes from OldCurrentPath into NewCurrentPath
– 2. Insert leaf entries in ‘OldCurrentPath’ into the new tree
• if leaf does not go into NewCurrentPath remove it from the NewCurrentPath
![Page 21: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/21.jpg)
Rebuilding Algorithm
– 3. Free space in ‘OldCurrentPath’ and ‘NewCurrentPath’
• delete nodes in ‘OldCurrentPath’
• if ‘NewCurrentPath’ Is empty delete its nodes
– 4. Process the next path in the old tree– only needs enough pages to cover
‘OldCurrentPath’• usually the height of the tree
![Page 22: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/22.jpg)
Potential Problems
• Anomaly 1: Natural cluster is split across two leaf nodes, or two distant clusters are placed in the same node
• Anomaly 2: Sub cluster ends up in the wrong leaf node
![Page 23: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/23.jpg)
Reducibility Theorem: Assume we rebuild CF-tree ti+1 of threshold Ti by the above algorithm, and let Si and Si+1be the sizes of ti
and ti+1 respectively. If Ti+1 > Ti, then Si+1<Si,
and the transformation from ti to ti+1 needs at most h extra pages of memory, where h is the height of t i
![Page 24: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/24.jpg)
BIRCH Clustering Algorithm
• Four Phases– 1. Loading
• scan all data and build an initial in-memory CF-tree
– 2. Optional Condensing• rebuild CF-tree to make it smaller
• faster analysis, but reduced accuracy
– 3. Global Clustering• run clustering algorithm on CFs (KMEANS)
• handles anomaly 1
![Page 25: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/25.jpg)
BIRCH Clustering Algorithm
– 4. Optional Refining• Use the centroids of the clusters as seeds
• Scan data again and assign points to closest seed
• Handles anomaly 2
![Page 26: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/26.jpg)
BIRCH Phase 1• Building CF-tree
– Heuristic Threshold (T default = 0)• when rebuilding need new T (diameter/radius)
• use avg distance of closest leaf pairs in same node
• should reduce size of tree by about half
![Page 27: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/27.jpg)
BIRCH Phase 1
– Outlier Handling Option• CF with small N(# data points) is saved to disk
• try to reinsert when– run out of memory
– finish reading data
• if data is noisy, improves runtime and accuracy
– Delay Split Option• about to run out of memory
• CFs that would cause the tree to split saved to disk
![Page 28: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/28.jpg)
How does this compare to other clustering methods?
Ran against Kmeans and CLARANS
![Page 29: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/29.jpg)
Results
![Page 30: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/30.jpg)
Results
![Page 31: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/31.jpg)
Results
![Page 32: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/32.jpg)
Results
![Page 33: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/33.jpg)
Runtime
![Page 34: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/34.jpg)
Conclusions• BIRCH vs CLARANS and KMEANS
– runs faster (fewer scans)– less order sensitive– less memory
![Page 35: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/35.jpg)
Where can I use this?
• Interactive and Iterative Pixel Classification– MVI ( Multi-band Vegetation Imager)– BIRCH helps classify pixels through clustering
• Code Generalization in Image Compression– compressing visual data to save space– code book - vector code words for image
blocks– BIRCH assigns nearest code word to vector
![Page 36: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/36.jpg)
Main limitations of BIRCH?
• Ability to only handle metric data.
![Page 37: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/37.jpg)
Name the two algorithms in BIRCH clustering?
1. Inserting
2. Rebuilding
![Page 38: Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d0b5503460f949df230/html5/thumbnails/38.jpg)
What is the purpose to have phase 4 in the BIRCH clustering algorithm?
-All copies of a given data point go to the same cluster.
-option to discard outliers
-can converge to a minimum