Clustering Hierarchical

download Clustering Hierarchical

of 31

Transcript of Clustering Hierarchical

  • 8/10/2019 Clustering Hierarchical

    1/31

    ClusteringLecture 3: Hierarchical Methods

    Jing GaoSUNY Buffalo

    1

  • 8/10/2019 Clustering Hierarchical

    2/31

    Outline

    Basics

    Motivation, definition, evaluation

    Methods

    Partitional

    Hierarchical

    Density-based

    Mixture model

    Spectral methods

    Advanced topics

    Clustering ensemble Clustering in MapReduce

    Semi-supervised clustering, subspace clustering, co-clustering,etc.

    2

  • 8/10/2019 Clustering Hierarchical

    3/31

    3

    Hierarchical Clustering

    Agglomerative approach

    b

    d

    c

    e

    aa b

    d e

    c d e

    a b c d e

    Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

    Initialization:

    Each object is a cluster

    Iteration:

    Merge two clusters which are

    most similar to each other;

    Until all objects are merged

    into a single cluster

  • 8/10/2019 Clustering Hierarchical

    4/31

    4

    Hierarchical Clustering

    Divisive Approaches

    b

    d

    c

    e

    aa b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

    Initialization:

    All objects stay in one cluster

    Iteration:

    Select a cluster and split it into

    two sub clusters

    Until each leaf cluster contains

    only one object

  • 8/10/2019 Clustering Hierarchical

    5/31

    5

    Dendrogram

    A tree that shows how clusters are merged/split

    hierarchically

    Each node on the tree is a cluster; each leaf node is a

    singleton cluster

  • 8/10/2019 Clustering Hierarchical

    6/31

    6

    Dendrogram

    A clustering of the data objects is obtained by cutting

    the dendrogramat the desired level, then eachconnected component forms a cluster

  • 8/10/2019 Clustering Hierarchical

    7/31

    Agglomerative Clustering Algorithm

    More popular hierarchical clustering technique

    Basic algorithm is straightforward

    1. Compute the distance matrix

    2. Let each data point be a cluster

    3. Repeat

    4. Merge the two closest clusters5. Update the distance matrix

    6. Untilonly a single cluster remains

    Key operation is the computation of the distance between

    two clusters Different approaches to defining the distance between clusters

    distinguish the different algorithms

    7

  • 8/10/2019 Clustering Hierarchical

    8/31

    Starting Situation

    Start with clusters of individual points and a distance matrix

    p1

    p3

    p5

    p4

    p2

    p1 p2 p3 p4 p5 . . .

    .

    .

    . Distance Matrix

    ...p1 p2 p3 p4 p9 p10 p11 p12

    8

  • 8/10/2019 Clustering Hierarchical

    9/31

    Intermediate Situation

    After some merging steps, we have some clusters

    Choose two clusters that has the smallest

    distance (largest similarity) to merge

    C1

    C4

    C2 C5

    C3

    C2C1

    C1

    C3

    C5

    C4

    C2

    C3 C4 C5

    Distance Matrix

    ...p1 p2 p3 p4 p9 p10 p11 p12

    9

  • 8/10/2019 Clustering Hierarchical

    10/31

    Intermediate Situation

    We want to merge the two closest clusters (C2 and C5) and update

    the distance matrix.

    C1

    C4

    C2 C5

    C3

    C2C1

    C1

    C3

    C5

    C4

    C2

    C3 C4 C5

    Distance Matrix

    ...p1 p2 p3 p4 p9 p10 p11 p12

    10

  • 8/10/2019 Clustering Hierarchical

    11/31

    After Merging

    The question is How do we update the distance matrix?

    C1

    C4

    C2 U C5

    C3

    ? ? ? ?

    ?

    ?

    ?

    C2U

    C5C1

    C1

    C3

    C4

    C2 U C5

    C3 C4

    Distance Matrix

    ...p1 p2 p3 p4 p9 p10 p11 p12

    11

  • 8/10/2019 Clustering Hierarchical

    12/31

    How to Define Inter-Cluster Distance

    p1

    p3

    p5

    p4

    p2

    p1 p2 p3 p4 p5 . . .

    .

    .

    .

    Distance?

    MIN

    MAX

    Group Average

    Distance Between Centroids

    Distance Matrix

    12

  • 8/10/2019 Clustering Hierarchical

    13/31

    MIN or Single Link

    Inter-cluster distance

    The distance between two clusters is represented by the

    distance of the closest pair of data objectsbelonging to

    different clusters.

    Determined by one pair of points, i.e., by one link in the

    proximity graph

    ),(min),(,

    min qpdCCdji

    CqCpji

    13

  • 8/10/2019 Clustering Hierarchical

    14/31

    MIN

    Nested Clusters Dendrogram

    1

    2

    3

    4

    5

    6

    1

    2

    3

    4

    5

    3 6 2 5 4 10

    0.05

    0.1

    0.15

    0.2

    14

  • 8/10/2019 Clustering Hierarchical

    15/31

    Strength of MIN

    Original Points Two Clusters

    Can handle non-elliptical shapes

    15

  • 8/10/2019 Clustering Hierarchical

    16/31

    Limitations of MIN

    Original Points Two Clusters

    Sensitive to noise and outliers

    16

  • 8/10/2019 Clustering Hierarchical

    17/31

    MAX or Complete Link

    Inter-cluster distance

    The distance between two clusters is represented by the

    distance of thefarthest pair of data objectsbelonging to

    different clusters

    ),(max),(,

    min qpdCCdji CqCp

    ji

    17

  • 8/10/2019 Clustering Hierarchical

    18/31

    MAX

    Nested Clusters Dendrogram

    3 6 4 1 2 50

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    1

    2

    3

    4

    5

    6

    1

    2 5

    3

    4

    18

  • 8/10/2019 Clustering Hierarchical

    19/31

    Strength of MAX

    Original Points Two Clusters

    Less susceptible to noise and outliers

    19

  • 8/10/2019 Clustering Hierarchical

    20/31

    Limitations of MAX

    Original Points

    Tends to break large clusters

    20

  • 8/10/2019 Clustering Hierarchical

    21/31

    21

    MIN (2 clusters) MAX (2 clusters)

    Limitations of MAX

    Biased towards globular clusters

  • 8/10/2019 Clustering Hierarchical

    22/31

    Group Average or Average Link

    Inter-cluster distance

    The distance between two clusters is represented by the

    averagedistance of all pairs of data objectsbelonging to

    different clusters

    Determined by all pairs of points in the two clusters

    ),(),(,

    min qpdavgCCdji CqCp

    ji

    22

  • 8/10/2019 Clustering Hierarchical

    23/31

    Group Average

    Nested Clusters Dendrogram

    3 6 4 1 2 50

    0.05

    0.1

    0.15

    0.2

    0.25

    1

    2

    3

    4

    5

    6

    1

    2

    5

    3

    4

    23

  • 8/10/2019 Clustering Hierarchical

    24/31

    Group Average

    Compromise between Single and Complete

    Link

    Strengths

    Less susceptible to noise and outliers

    Limitations

    Biased towards globular clusters

    24

  • 8/10/2019 Clustering Hierarchical

    25/31

  • 8/10/2019 Clustering Hierarchical

    26/31

    Wards Method

    Similarity of two clusters is based on the increasein squared error when two clusters are merged Similar to group average if distance between points is

    distance squared

    Less susceptible to noise and outliers

    Biased towards globular clusters

    Hierarchical analogue of K-means Can be used to initialize K-means

    26

  • 8/10/2019 Clustering Hierarchical

    27/31

    Comparison

    Group Average

    Wards Method

    1

    23

    4

    5

    61

    2

    5

    3

    4

    MIN MAX

    1

    23

    4

    5

    6

    1

    2

    5

    34

    1

    2

    3

    4

    5

    6

    1

    2 5

    3

    41

    2

    3

    4

    5

    6

    1

    2

    3

    4

    5

    27

  • 8/10/2019 Clustering Hierarchical

    28/31

    Time and Space Requirements

    O(N2) space since it uses the distance matrix

    Nis the number of points

    O(N3) time in many cases

    There are Nsteps and at each step the size, N2,

    distance matrix must be updated and searched

    Complexity can be reduced to O(N2log(N) ) time

    for some approaches

    28

  • 8/10/2019 Clustering Hierarchical

    29/31

    Strengths

    Do not have to assume any particular numberof clusters

    Any desired number of clusters can be obtained

    by cutting the dendrogram at the proper level

    They may correspond to meaningfultaxonomies

    e.g., shopping websiteselectronics (computer,camera, ..), furniture, groceries

    29

  • 8/10/2019 Clustering Hierarchical

    30/31

    Problems and Limitations

    Once a decision is made to combine two clusters,it cannot be undone

    No objective function is directly minimized

    Different schemes have problems with one ormore of the following: Sensitivity to noise and outliers

    Difficulty handling different sized clusters andirregular shapes

    Breaking large clusters

    30

  • 8/10/2019 Clustering Hierarchical

    31/31

    Take-away Message

    Agglomerative and divisive hierarchical clustering

    Several ways of defining inter-cluster distance

    The properties of clusters outputted by different

    approaches based on different inter-cluster distance

    definition

    Pros and cons of hierarchical clustering

    31