Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the...

Post on 13-Dec-2015

214 views 0 download

Transcript of Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the...

Clustering Sequential Data: Research Paper Review

Presented by Glynis Hawley

April 28, 2003

On the Optimal Clustering of Sequential Data by Cheng-Ru Lin and Ming-Syan Chen, Electrical Engineering Department National Taiwan University, Taipei, Taiwan

Second SIAM International Conference on Data Mining April 11-13, 2002

http://www.siam.org/meetings/sdm02/proceedings/sdm02-09.pdf

Agenda

Introduction: What is sequential clustering?

Problem definition for algorithm design

Optimal Algorithm: SCOPT

Greedy Algorithm: SCGD

Conclusion

Sequential Clustering Problem

Attributes and sequence of objects are both important.

Objects within a cluster form a continuous region.

An object within one cluster may be closer to the centroid of a different cluster than it is to its own centroid.

Conventional Clustering vs. Sequential Clustering

Conventional Clustering

1

2

34

5

6

7

8

9

1011

12

1314

15

X

Y

Sequential Clustering

1

2

34

5

6

7

8

9

1011

12

1314

15

X

Y

Application Areas

Analysis of motion patterns of objects.– Cellular phones.

Analysis of status logs of running machines.

Problem Definition Partitioning problem

– n sequential objects into k clusters Dissimilarity measurement

– Squared Euclidean distance Cluster quality

– Cost measurement: penalizes clusters for amount of dissimilarity of objects

Best solution minimizes the sum of the costs of all clusters

m

iiE

coDClCost1

),()( 2

Cost Definition Cost of a cluster: summation over all m

objects of the squared Euclidean distance of the object from the cluster centroid.

Sequential Clustering Algorithms

Optimal Sequential Clustering Algorithm– SCOPT

Greedy Sequential Clustering Algorithm– SCGD

Algorithm SCOPT

Determines optimal k-partition of a set of sequential objects.

Uses the property of optimal substructure.– Systematically solves all possible sub-

problems.– Stores results to be used in later steps.

Complexity of Algorithm SCOPT

Time: O (kn2) Space: O (kn)

Initially, arbitrarily insert separators to divide the n objects into k clusters.

1 2 3 | 4 5 6 | 7 8 9

Algorithm SCGD

Reposition the separators by “moves” and “jumps” to reduce the cost of the clusters.

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

The best possible move or jump is determined by calculating the cost reductions of all possible moves and jumps.

Algorithm SCGD (Cont.)

move

jump

move

jump

Algorithm SCGD (Cont.)

Continue repositioning separators until no further cost reductions are possible.

Complexity– Time: O (nl / k + n), linear– Space: O (k)

Quality of clusters increases with n and with average cluster size.

Conclusion Sequential clustering requires that the

sequence of data points be considered as well as the similarity of attributes.

Algorithms:– SCOPT and SCGD

– SCGD approaches SCOPT in terms of quality of clusters when average cluster sizes are large.