Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf ·...

35
1 CS 5331 by Rattikorn Hewett Texas Tech University 1 Cluster Analysis Han and Kamber, ch 8 2 Outline Overview Input Data types Clustering methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Outlier Analysis Summary

Transcript of Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf ·...

Page 1: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

1

CS 5331 by Rattikorn Hewett Texas Tech University 1

Cluster Analysis

Han and Kamber, ch 8

2

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

Page 2: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

2

3

Motivation

n  Given a data set that we do not know what to look for ¨ The first step in identifying useful patterns is to

group data by their similarity ¨ Once data are grouped (clustered/categorized),

properties of each group can be analyzed n  Cluster Analysis

¨  Is a process of grouping data into meaningful classes/clusters (e.g., those with similar measures)

¨  Is unsupervised learning – by observations vs. by examples ¨ Can be used

n  As a stand-alone tool to get insight on data distribution n  As a preprocessing step for other algorithms

4

Examples Applications

n  Bioinformatics: Categorize genes with similar functions n  Marketing: Characterize customer based on purchasing patterns for

targeted marketing programs n  Land use: Identify areas of similar land use from earth observation data n  Insurance: Identify groups of motor insurance policy holders with a high

average claim cost n  City-planning: Identify groups of houses according to their house type,

value, and geographical location n  Earth-quake studies: Characterize earth quake epicenters along

continent faults n  WWW: Document classification, cluster Weblog data to discover groups

of similar access patterns

Page 3: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

3

5

Evaluation and Issues

n  Basic criteria for evaluating clustering method ¨ quality of results (depending on similarity measures applied)

n  High qualities clusters à high intra-class and low inter-class similarity

¨ ability to discover some or all of the hidden patterns

n  Main issue in data mining – scalability

¨  Size of databases: need efficient clustering techniques for very large, high dimensional databases

¨  Complexity of data types and shapes: need clustering techniques for mixed numerical & categorical data

6

Requirements

n  Scalability ¨  Large databases ¨  High dimensionality ¨  Ability to deal with different types of attributes

n  Performance ¨  Robust - able to deal with noise and outliers ¨  Insensitive to order of input records ¨  Discovery of clusters with arbitrary shape

n  Practical Aspects ¨  Minimal requirements for domain knowledge to determine input

parameters ¨  Incorporation of user-specified constraints ¨  Interpretability and usability

Page 4: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

4

7

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

8

Typical data structures

n  Memory-based clustering algorithms typically operate on two data structures: ¨ Data matrix – n data instances with p attributes

¨ Dissimilarity matrix – d(i,j) = difference/dissimilarity between data instances i and j

⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢

npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

Row & cols represent different entities

One-mode matrix Two-mode matrix

Page 5: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

5

9

Input data types

n  Interval-scaled variables rough measurements on a continuous “linear” scale E.g., weights, coordinates

n  Binary n  Nominal, ordinal, and ratio-scaled

¨ Nominal: extend binary variables to have more than two values e.g., map-color = {blue, yellow, red, green}

¨ Ordinal: extend nominal to be ordered in a meaningful sequence e.g., medal = {gold, silver, bronze}

¨ Ratio-scaled: +ve measurements on a non-linear scale e.g., growth and decay – exponential scales

n  Mixed – attributes have different data types

10

Prepare input data

From a given data set, we need to 1)  Standardize/normalize data (if needed) 2)  Compute the difference matrix

Page 6: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

6

11

Interval-scaled variables (1)

n  Standardize data – for each column (attribute) f¨  Calculate the mean absolute deviation:

where

¨  Calculate the standardized measurement (z-score)

n  Using mean absolute deviation is more robust than using standard deviation (deviation is not square à reduce effect of outliers)

.)...211

nffff xx(xn m +++=

|)|...|||(|1 21 fnffffff mxmxmxns −++−+−=

f

fifif s

mx z

−=

12

Interval-scaled variables (2)

n  Distances are normally used to measure the similarity or dissimilarity between two data objects

n  Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

n  If q = 1, d is Manhattan distance n  If q = 2, d is Euclidean distance

qq

pp

qq

jxixjxixjxixjid )||...|||(|),(2211

−++−+−=

)||...|||(|),( 22

22

2

1121

pp jxixwjxixwjxixwjid p −++−+−=Weighted Euclidean Distance

Page 7: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

7

13

Binary Variables (2)

Computing Dissimilarity matrix n  A contingency table for binary data

n  Simple matching coefficient: for symmetric binary variables I.e., values 0 and 1 are equally valuable e.g., gender (0 = male, 1 = female)

n  Jaccard coefficient: for asymmetric binary variables I.e., outcome 1 is more important than 0, e.g., 1 = HIV pos, 0 = HIV neg

dcbacb jid+++

+=),(

cbacb jid++

+=),(

pdbcasumdcdcbaba

sum

++

+

+

01

01

Instance i

Instance j

Note: b, c represent 1-0 and 0-1 d represents 0-0

14

Example

¨  gender is a symmetric attribute ¨  the remaining attributes are asymmetric binary ¨  let the values Y and P be set to 1, and the value N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

75.021121),(

67.011111),(

33.010210),(

=++

+=

=++

+=

=++

+=

maryjimd

jimjackd

maryjackd

Page 8: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

8

15

Nominal Variables (2)

n  A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Computing dissimilarity matrix:

n  Method 1: Simple matching ¨  m = # of matches (# variables that instances i and j have same state)

¨  p = total # of variables

n  Method 2: use a large number of binary variables ¨  create a new binary variable for each of the nominal states

pmpjid −=),(

16

Ordinal Variables

n  An ordinal variable can be discrete or continuous n  Order is important, e.g., rank n  Can be treated like interval-scaled

¨  Let xif be the data value of variable f for the ith instance

¨  Replace, xif by its rank rif ∈ {1, .., Mf} ¨  Normalize each variable so that its range would be in [0, 1] by

replacing rif by

¨  Compute the dissimilarity using methods for interval-scaled variables

Zif = 11−

f

if

Mr

Page 9: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

9

17

Ratio-Scaled Variables

n  Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, E.g., AeBt or Ae-Bt

n  Methods: ¨  treat them like interval-scaled variables—not a good choice! (why?

—the scale can be distorted)

¨  apply logarithmic transformation: yif= log(xif) then use techniques for interval-scaled variables

¨  treat xif as continuous ordinal data & treat their ranks as interval-scaled values - recommended

18

Mixed Variable Types

n  A database may contain all the six types of variables ¨  symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio

n  One may use a weighted formula to combine their effects

¨ δij(f) = 0 if either (1) xif or xjf is missing, or (2) f is

asymmetric binary and xif = xjf = 0 ; o.w. δij(f) = 1

¨ d(f)ij is computed according to the data type of f

∑=

∑== p

ffij

p

ffij

fij d

jid1

1)(

)(

)(

),(δ

δ

Page 10: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

10

19

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

20

Categories of Clustering Methods

n  Partitioning algorithms: Construct various partitions and then evaluate them by some criterion

n  Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

n  Density-based: based on connectivity and density functions

n  Grid-based: based on a multiple-level granularity structure

n  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Page 11: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

11

21

Partitioning Methods

n  Given ¨ a database of n objects ¨ k = number of clusters to form, k ≤ n

n  To find k partitions of the database that optimize a “similarity function” in that ¨  Objects of different clusters/partitions are “dissimilar” ¨  Objects of the same partition are “similar”

n  Basic Idea: ¨  Create initial partitions ¨  Iterative relocation – move objects from one group to the other to

improve objective function (e.g., min. error)

22

Partitioning Methods (cont)

n  Approaches ¨  Global optimal: exhaustively enumerate all partitions ¨  Heuristic methods:

n  Basic idea: ¨  Pick a reference point of each cluster ¨  Assign data objects to minimize sum of dissimilarities between

each object and its reference point of the same cluster n  k-means

Each cluster is referenced by the center of the cluster n  k-medoids

Each cluster is referenced by one of the objects in the cluster

Page 12: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

12

23

The k-Means Method

n  Input: k = number of clusters and a database of n objects n  Output: a set of k clusters that min the squared-error

n  Algorithm (Sketched)

¨  Randomly pick k objects as center of clusters ¨  Repeat

n  Assign each object to the cluster with the nearest center

n  Update the center (mean value of objects in the cluster) of each cluster

¨  Until no change (i.e., the squared-error converges)

n  squared-error: ∑∑=

∈−=

k

iCp ii

mpE1

2|| where mi = mean of cluster Ci and p = object point

24

The k-Means Method K=2 Arbitrarily choose K objects as initial cluster center

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Assign each objects to most similar center

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Update the cluster means 0

1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

reassign

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

reassign

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Page 13: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

13

25

Advantages and disadvantages n  Strength

¨ Relatively efficient n O(tkn), where n is # objects, k is # clusters, and t is # iterations. k, t << n. n Other clustering methods, e.g., PAM: O(k(n-k)2 )

n  Weakness ¨ Applicable only when mean is defined - categorical data? ¨ Need to specify the number of clusters, k, in advance ¨ Mean is sensitive to noisy data and outliers ¨ Not suitable to discover clusters with non-convex shapes

n  Often terminates at a local optimum. The global optimum may be found using techniques such as deterministic annealing and genetic algorithms

26

Variations of the k-Means Method

n  Variants of the k-means differ in ¨  Selection of the initial k means ¨  Dissimilarity calculations ¨  Strategies to calculate cluster means

n  k-modes method: handles categorical data [Huang’98] ¨  Replacing means of clusters with modes ¨  Use a frequency-based method to update modes of clusters

n  k-prototypes method: integrates k-means and k-modes to handle mixed data types

n  EM (Expectation Maximization): extends k-means by assigning each object to a cluster according to a weight representing the probability of membership

n  Scalable k-means method – Discard object whose membership is ascertained

Page 14: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

14

27

Problem with k-Means Method

n  The k-means algorithm is sensitive to outliers ! ¨  Since an object with an extremely large value may substantially distort the

distribution of the data

n  k-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

n  k-means is centroid-based, k-medoids is most central object-based

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

28

The k-Medoids Method

n  Input: k = number of clusters and a database of n objects n  Output: a set of k clusters that min sum of dissimilarities of all objects to

their nearest medoid (most central object) n  Algorithm (Sketched)

¨  Randomly pick k objects as medoids of k clusters ¨  Repeat

n  Assign each remaining object to the cluster with the nearest medoid n  Randomly select a non-medoid object, Orandom n  Compute cost of swapping a medoid object Oj with Orandom n  If the cost is < 0 then swap (i.e., Orandom is a medoid and Oj is not)

¨  Until no change (i.e., the squared-error converges) n  Cost ~ squared-error with swapping - squared-error without swapping

if cost < 0 means swapping reduces squared-error

Page 15: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

15

29

Variations of the k-Medoids Methods

n  PAM(Partitioning Around Medoids) [Kaufman and Rousseeuw, 87] ¨  Randomly select an initial set of medoids ¨  Iteratively replace the medoid by the non-medoid that gives the

most error reduction

n  PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean

n  PAM works effectively for small data sets, but does not scale well for large data sets ¨  O(k(n-k)2 ) for each iteration where n is # of data objects, k is # of clusters

30

Variations of the k-Medoids Methods

n  CLARA (Clustering LARge Application) [Kaufmann & Rousseeuw, 90] ¨  Uses multiple samples of the data set, applies PAM on each sample,

and gives the best clustering as the output ¨  Strength: deals with larger data sets than PAM ¨  Weakness:

n  Efficiency depends on the sample size n  A good clustering based on samples will not necessarily

represent a good clustering of the whole data set if the sample is biased

n  CLARANS (Clustering Large Applications based on RANdomized search) [Ng & Han, 94] ¨  Combines CLARA’s idea and random search to avoid sample’s bias ¨  CLARAN does not confine medoid search to a fixed sample but it randomly

selects nodes to be searched after a local optimum is found ¨  More efficient than PAM and CLARA

Page 16: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

16

31

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods

n  Outlier Analysis n  Summary

32

Hierarchical Clustering Method

n  Agglomerative method – bottom-up (merge iteratively) ¨  Place objects in its own cluster, and merge the clusters into larger

clusters until all objects are in a single cluster or termination ¨  Most hierarchical methods are of this kind – they differ by the

definition of “intercluster” similarity ¨  Example: AGNES (AGglomerative NESting)

n  Divisive method – top-down (divide iteratively) ¨  Start with all objects are in one cluster, and subdivides the cluster

into smaller pieces until termination (e.g., reach specified number of clusters, Two closest clusters are close enough)

¨  Example: DIANA (DIvisive ANAlysis)

Page 17: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

17

33

Hierarchical Methods

n  Needs a termination condition – not necessary be the number of clusters as an input

n  Use distance matrix as clustering criteria

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

34

Measures of distances

Let d(p, p’) be a distance between points p and p’ pi and pj be points in clusters Ci and Cj , respectively ni be a number of points in clusters Ci, and mi be a mean of points in clusters Ci

n  Minimum distance: dmin(Ci, Cj) = mini, j d(pi, pj) n  Maximum distance: dmax(Ci, Cj) = maxi, j d(pi, pj) n  Mean distance: dmean(Ci, Cj) = d(mi, mj) n  Average distance: davg(Ci, Cj) =

∑ji

jiji

ppdnn ,),(1

Page 18: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

18

35

AGNES & DIANA

n  Introduced in [Kaufmann and Rousseeuw, 90] n  Implemented in statistical analysis packages, e.g., S+ n  Use the Single-Link method (the similarity between two clusters is the similarity

between the closest pair of points in each cluster) and the dissimilarity matrix n  AGNES

¨  Merge nodes that have the least dissimilarity ¨  Go on in a non-descending fashion ¨  Eventually all nodes belong to the same cluster

n  DIANA performs in inverse order of AGNES

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

36

Issues

n  Major weakness ¨  do not scale well: e.g., agglomerative time complexity of at least

O(n2), where n is the number of total objects ¨  Irreversable process

n  To improve these drawbacks: ¨  Integration of hierarchical with iterative relocation

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies [Zhang Ramakrishnan & Livny, 96]

¨  Careful analysis of object “linkages” at each level of hierarchy CURE Clustering Using Representatives [Guha, Rastogi & Shim, 98]

ROCK Robust Clustering using linKs [Guha, Rastogi & Shim, 99] CHAMELEON [Karypis, Han & Kumar, 99]

Page 19: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

19

37

BIRCH

n  Basic Idea: Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering ¨ Phase 1: scan DB to build an initial in-memory CF tree

to preserve inherent hierarchical structures of the data ~ similar to buidling B+-trees

¨ Phase 2: apply a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree

38

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: ∑Ni=1=Xi

SS: ∑Ni=1=Xi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Clustering Feature

Summarizes statistics (0, 1th, 2nd moments) of a sub-cluster

Page 20: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

20

39

BIRCH

n  Scales linearly ¨  finds a good clustering with a single scan and improves the quality

with a few additional scans

n  Weaknesses ¨  handles only numeric data

¨  sensitive to the order of the data record

¨  Uses notions of diameters for clustering à doesn’t perform well for non-spherical shapes

40

CURE

¨ Basic Idea: n  use multiple representative points for each cluster à adjust well with non-spherical shapes

n  Shrink the representative points toward the center of the clusterà dampen effects of outliers

¨ Scales well without sacrificing clustering quality ¨ Does not handle categorical data and results are

significantly impacted from parameter setting

Page 21: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

21

41

ROCK

n  Basic Idea: ¨  Construct sparse graph from a similarity matrix using concept of “interconnectivity”

¨  Hirarchical clustering on the sparse graph

n  Similarity measures are not distance-based n  Cluster “similarity” is based on

¨  the number of points from different clusters that have neighbors in common

n  Suitable for clustering categorical data

42

CHAMELEON

n  Hierarchical clustering based on ¨  k-nearest neighbors ¨  Dynamic modeling

n  Measures the similarity based on interconnectivity and closeness (proximity) ¨  If these measures between two clusters are high relative to the

internal measures within the clusters, merge the clusters ¨  Merge is based on a dynamic model adapting to internal

structures of the clusters being merged n  Cure ignores interconnectivity of objects n  Rock ignores the closeness of clusters

Page 22: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

22

43

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

44

Density-Based Clustering Methods

n  Clustering based on density (local cluster criterion), such as density-connected points

n  Major features: ¨  Discover clusters of arbitrary shape ¨  Handle noise ¨  One scan ¨  Need density parameters as termination condition

n  Several interesting studies: ¨  DBSCAN [Ester et al., 96] ¨  OPTICS [Ankerst et al., 99] ¨  DENCLUE [Hinneburg & Keim, 98] ¨  CLIQUE [Agrawal et al., 98]

Page 23: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

23

45

Terminologies n  ε-neighborhood of an object p, Nε(p) = {q | dist(p,q) ≤ ε} n  q is a core object if Nε(q) contains at least M, min number of objects

n  p is directly density-reachable from q if

¨  p is in Nε(q) and

¨  q is a core point

p q

MinPts = 5

ε = 1 cm

p is directly density-reachable from q

46

Additional terms

n  Density-reachable: ¨  p is density-reachable from q if there

is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi wrt. ε, MinPts

n  Density-connected ¨  A point p is density-connected to a

point q if there is a point o such that both, p and q are density-reachable from o wrt. ε and MinPts

n  Density-reachable is transitive

n  Density-connected is symmetric

p

q p1

p q

o

Page 24: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

24

47

Density-based Clustering n  DBSCAN (Density-based Spatial Clustering of Applications with Noise)

¨  Grows region with sufficiently high density into clusters ¨  A cluster has a maximal set of density-connected points with respect to

density-reachability ¨  Objects not in any cluster is considered “noise” ¨  Requires input parameters: e and MinPts

n  OPTICS (Ordering Points To Identify the Clustering Structure) ¨  Computes cluster ordering for automatic and interactive analysis ¨  The ordering represents the density-based clustering structure (select an

object in ways that clusters with small radius ε (high density) will be finished first

48

Density-based Clustering n  DENCLUE (Density-based Spatial ClustEring)

¨  Solid mathematical foundation – use influence function ¨  Good for data sets with large amounts of noise ¨  Allows a compact mathematical description of arbitrarily shaped

clusters in high-dimensional data sets ¨  Significant faster than existing algorithm (faster than DBSCAN by a

factor of up to 45) ¨  But needs a large number of parameters

Page 25: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

25

49

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

50

Grid-Based Clustering Method

n  Basic steps: ¨  Quantize the object space into a finite number of grid cells

¨  Perform clustering on the grid structure

n  Examples

STING [Wang et al., 97] a grid-based method using statistical information stored in grid cells to answer queries

WaveCluster [Sheikholeslami et al., 98] and CLIQUE [Agrawal, et al., 98] are both grid-based and density-based

Page 26: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

26

51

STING n  Each cell

¨  at a high level is partitioned into smaller cells in the next lower level ¨  Has statistical information computed

E.g., means, max, min, distribution types n  Query-answering is a top-down process

¨  Determine highest layer to start ¨  For each cell in the layer,

n  Compute confidence of intervals to reflect relevance to query

n  Remove irrelevant cells n  Move to next lower level

¨  Repeat the above step until the query specification is met Return the region of relevant cells

n  PROS: fast, easy to parallelize, CONS: boundaries are only horizontal/vertical

52

WaveCluster

n  A multi-resolution clustering approach ¨  Impose grid structures on the data space ¨  Apply wavelet transform to each feature space (from grid cell info)

n  Decompose a signal into different frequency sub-band n  Preserve relative distance between objects at different resolutions

¨  Clustering by finding dense regions in the transformed space n  Both grid-based and density-based with input parameters:

¨  # of grid cells for each dimension ¨  the wavelet, and the # of applications of wavelet transform

n  Advantages: ¨  Scales for large databases – O(n) ¨  Do not require # clusters or neighborhood radius ¨  Effective removal of outliers ¨  Detect arbitrary shaped clusters at different scales ¨  Not sensitive to noise, not sensitive to input order

Page 27: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

27

53

Transformation

High resolution Medium resolution Low resolution

Decompose data into four frequency sub-bands focusing on •  mean neighborhood of each point •  horizontal edges •  vertical edges •  corners

54

CLIQUE n  Grid-based and density-based using the Apriori principle

¨  Partition space into grid structure ¨  Cluster - maximal set of connected dense units

n  Narrowing search space: A k-dimensional candidate dense unit is not dense if its (k-1)-dimensional projection unit is not dense

¨  Determine min cover of each cluster n  PROS:

¨  Scales well - good for clustering high-dimensional data in large databases

¨  Insensitive to the order of input n  CONS:

¨  low accuracy of the clustering at the expense of simplicity

Page 28: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

28

55

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

56

Model-Based Clustering Methods n  Goal – to fit the data to some mathematical model n  Statistical approach ~ Conceptual learning

¨  Cobweb ¨  CLASSIT ¨  Autoclass

n  Neural net approach ¨  SOM (Self Organizing Map)

Page 29: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

29

57

Statistical Approach

n  Conceptual learning: cluster (group like objects) + characterize (generalize & simplify) concepts

n  COBWEB [Fischer, 87] ¨  Build a classification tree where each node (not branch) has a label ¨  Use heuristic evaluation function called category utility that

n  Represents a difference in expected number of attribute values when a hypothesized category is used and unused

n  Rewards intraclass similarity and interclass dissimilarity n  Based on assumption that attributes are independent

¨  CONS: n  independent assumption n  Expensive to compute and update probability distribution n  not height-balanced – bad for skewed data

n  CLASSIT – extension of COBWEB for continuous data – similar problems n  AutoClass – uses Bayesian statistical analysis – popular in industries

58

Neural Net Approach

n  Competitive learning ¨  Hierarchical architecture of neuron units where

n  Each unit in a given layer takes inputs from ALL units from previous layer n  Units within a cluster in a layer compete – pick one that best corresponds

to output from previous layer n  The winning unit adjust weights on its connections

¨  Cluster ~ mapping of low-level features to high-level features

n  SOM (Self-Organizing Feature Maps) ¨  Unit whose weight vector is closest to current object becomes the

winner ¨  More details in reference papers

Page 30: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

30

59

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

60

Outlier Discovery

n  Outliers : data objects that ¨  do not comply with general behavior ¨  are considerably dissimilar (inconsistent) to remaining data

n  Generally, data mining algorithms try to min effect of outliers But … “one person’s noise could be another person’s signal”

n  Example Applications: ¨  Credit card fraud detection ¨  Telecom fraud detection ¨  Marketing customized to purchasers with extreme incomes ¨  Finding unusual responses to medical treatments

n  Outlier Discovery: Find top n outlier points

Page 31: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

31

61

Approaches

n  Statistical-based ¨  Assume a data distribution model (e.g., normal) ¨  Apply hypothesis testing to see if objects belong to the distribution ¨  Drawbacks:

n  Most tests are for single attributes n  Data distribution may not be known

n  Distance-based ¨  Outliers – objects that do not have enough neighbors ¨  Various algorithms: index-based, nested-loop, cell-based ¨  Drawbacks:

n  Does not scale well for high dimensions n  Requires experimentation for input parameters

62

Approaches (contd)

n  Deviation-based ¨  Outliers – objects deviate from main characteristics of data ¨  Example techniques:

n  Sequential exception technique n  OLAP data cube technique (explore regions of anomalies – ch 2)

¨  Sequential exception technique n  Given set S of n objects, build a sequence of subsets:

S1 ⊂ S2 ⊂ … … ⊂ Sm where m ≤ n n  For each subset, test its similarity with the preceding subset in sequence n  Find the smallest subset whose removal results in greatest reduction of

dissimilarity in the residual set – called “Exception set” – outliers

Page 32: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

32

63

Outline

n  Overview n  Input Data types n  Clustering methods

¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods

n  Outlier Analysis n  Summary

64

Problems and Challenges

n  Considerable progress has been made in scalable clustering methods ¨  Partitioning: k-means, k-medoids, CLARANS ¨  Hierarchical: BIRCH, CURE ¨  Density-based: DBSCAN, CLIQUE, OPTICS ¨  Grid-based: STING, WaveCluster ¨  Model-based: Autoclass, Denclue, Cobweb

n  Current clustering techniques do not address all the requirements adequately

n  Constraint-based clustering analysis: Constraints exist in data space (bridges and highways) or in user queries

Page 33: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

33

65

Constraint-Based Clustering Analysis

n  Less parameters but more user-desired constraints E.g., an ATM allocation problem

66

Clustering With Obstacle Objects

Taking obstacles into account Not Taking obstacles into account

Page 34: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

34

67

Summary

n  Cluster analysis groups objects based on their similarity and has wide applications

n  Measure of similarity can be computed for various types of data

n  Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

n  Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches

n  There are still lots of research issues on cluster analysis, such as constraint-based clustering

68

References n  R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high

dimensional data for data mining applications. SIGMOD'98 n  M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. n  M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the

clustering structure, SIGMOD’99. n  P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996 n  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters

in large spatial databases. KDD'96. n  M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing

techniques for efficient class identification. SSD'95. n  D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,

2:139-172, 1987. n  D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on

dynamic systems. In Proc. VLDB’98. n  S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.

SIGMOD'98. n  A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

Page 35: Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf · Cluster Analysis " Is a process of grouping data into meaningful classes/clusters

35

69

References (cont) n  L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis.

John Wiley & Sons, 1990. n  E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. n  G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.

John Wiley and Sons, 1988. n  P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. n  R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. n  E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.

Proc. 1996 Int. Conf. on Pattern Recognition, 101-105. n  G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering

approach for very large spatial databases. VLDB’98. n  W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining,

VLDB’97. n  T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very

large databases. SIGMOD'96.