CS583 Clustering

download CS583 Clustering

of 40

Transcript of CS583 Clustering

  • 8/11/2019 CS583 Clustering

    1/40

    UIC - CS 594 1

    Chapter 5: Clustering

  • 8/11/2019 CS583 Clustering

    2/40

    UIC - CS 594 2

    Searching for groups Clustering is unsupervised or undirected.

    Unlike classification, in clustering, no pre-

    classified data. Search for groups or clusters of data

    points (records) that are similar to oneanother.

    Similar points may mean: similarcustomers, products, that will behave insimilar ways.

  • 8/11/2019 CS583 Clustering

    3/40

    UIC - CS 594 3

    Group similar points together Group points into classes using some

    distance measures.

    Within-cluster distance, and between clusterdistance

    Applications:

    As a stand-alone toolto get insight into datadistribution

    As a preprocessing stepfor other algorithms

  • 8/11/2019 CS583 Clustering

    4/40

    UIC - CS 594 4

    An Illustration

  • 8/11/2019 CS583 Clustering

    5/40

    UIC - CS 594 5

    Examples of ClusteringApplications

    Marketing: Help marketers discover distinctgroups in their customer bases, and then use this

    knowledge to develop targeted marketing

    programs Insurance:Identifying groups of motor insurance

    policy holders with some interesting

    characteristics.

    City-planning:Identifying groups of houses

    according to their house type, value, and

    geographical location

  • 8/11/2019 CS583 Clustering

    6/40

    UIC - CS 594 6

    Concepts of Clustering Clusters

    Different ways of

    representing clusters Division with boundaries

    Spheres

    Probabilistic

    Dendrograms

    1 2 3

    I1

    I2

    In

    0.5 0.2 0.3

  • 8/11/2019 CS583 Clustering

    7/40

    UIC - CS 594 7

    Clustering Clustering quality

    Inter-clusters distance maximized

    Intra-clusters distance minimized

    The qualityof a clustering result depends on boththe similarity measure used by the method and itsapplication.

    The qualityof a clustering method is alsomeasured by its ability to discover some or all ofthe hiddenpatterns

    Clustering vs. classification

    Which one is more difficult? Why?

    There are a huge number of clustering techniques.

  • 8/11/2019 CS583 Clustering

    8/40

    UIC - CS 594 8

    Dissimilarity/Distance Measure

    Dissimilarity/Similarity metric: Similarity isexpressed in terms of a distance function, whichis typically metric: d (i, j)

    The definitions of distance functions are usually

    very different for interval-scaled, boolean,categorical, ordinal and ratio variables.

    Weights should be associated with different

    variables based on applications and datasemantics.

    It is hard to define similar enough or goodenough. The answer is typically highly subjective.

  • 8/11/2019 CS583 Clustering

    9/40

    UIC - CS 594 9

    Types of data in clustering

    analysis

    Interval-scaled variables

    Binary variables

    Nominal, ordinal, and ratio variables

    Variables of mixed types

  • 8/11/2019 CS583 Clustering

    10/40

    UIC - CS 594 10

    Interval-valued variables

    Continuous measurements in a roughly linearscale, e.g., weight, height, temperature, etc

    Standardize data (depending on applications)

    Calculate the mean absolute deviation:

    where

    Calculate the standardized measurement (z-score)

    .)...21

    1nffff

    xx(xnm

    |)|...|||(|121 fnffffff mxmxmxns

    f

    fif

    if s

    mxz

  • 8/11/2019 CS583 Clustering

    11/40

    UIC - CS 594 11

    Similarity Between Objects Distance: Measure the similarity ordissimilarity

    between two data objects

    Some popular ones include: Minkowski

    distance:

    where (xi1, xi2, , xip) and(xj1, xj2, , xjp) are two p-

    dimensional data objects, and qis a positive integer

    If q= 1, dis Manhattan distance

    q q

    pp

    qq

    jxixjxixjxixjid )||...|||(|),( 2211

    ||...||||),(2211 pp j

    xi

    xj

    xi

    xj

    xi

    xjid

  • 8/11/2019 CS583 Clustering

    12/40

    UIC - CS 594 12

    Similarity Between Objects (Cont.)

    If q= 2,d is Euclidean distance:

    Properties d(i,j)0

    d(i,i)= 0

    d(i,j)= d(j,i)

    d(i,j)

    d(i,k)+ d(k,j)

    Also, one can use weighted distance, and

    many other similarity/distance measures.

    )||...|||(|),( 2222

    2

    11 pp jx

    ix

    jx

    ix

    jx

    ixjid

  • 8/11/2019 CS583 Clustering

    13/40

  • 8/11/2019 CS583 Clustering

    14/40

    UIC - CS 594 14

    Dissimilarity of Binary Variables Example

    gender is a symmetric attribute (not used below) the remaining attributes are asymmetric attributes

    let the values Y and P be set to 1, and the value Nbe set to 0

    Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

    Jack M Y N P N N N

    Mary F Y N P N P N

    Jim M Y P N N N N

    75.0

    211

    21),(

    67.0111

    11),(

    33.0

    102

    10),(

    maryj imd

    j imjackd

    maryjackd

  • 8/11/2019 CS583 Clustering

    15/40

    UIC - CS 594 15

    Nominal Variables A generalization of the binary variable in that it

    can take more than 2 states, e.g., red, yellow,

    blue, green, etc

    Method 1: Simple matching

    m: # of matches,p: total # of variables

    Method 2: use a large number of binary variables

    creating a new binary variable for each of the M

    nominal states

    pmp

    jid ),(

  • 8/11/2019 CS583 Clustering

    16/40

    UIC - CS 594 16

    Ordinal Variables

    An ordinal variable can be discrete or continuous Order is important, e.g., rank

    Can be treated like interval-scaled (f is a variable)

    replace xif by their ranks map the range of each variable onto [0, 1] by replacing

    i-thobject in thef-th variable by

    compute the dissimilarity using methods for interval-

    scaled variables

    1

    1

    f

    ifif M

    rz

    },...,1{ fif Mr

  • 8/11/2019 CS583 Clustering

    17/40

    UIC - CS 594 17

    Ratio-Scaled Variables Ratio-scaled variable: a measurement on a

    nonlinear scale, approximately at exponentialscale, such asAeBtorAe-Bt, e.g., growth of a

    bacteria population.

    Methods: treat them like interval-scaled variablesnot a good idea!

    (why?the scale can be distorted)

    apply logarithmic transformation

    yif =log(xif)

    treat them as continuous ordinal data and then treat their

    ranks as interval-scaled

  • 8/11/2019 CS583 Clustering

    18/40

    UIC - CS 594 18

    Variables of Mixed Types A database may contain all six types of variables

    symmetric binary, asymmetric binary, nominal,ordinal, interval and ratio

    One may use a weighted formula to combinetheir effects

    f is binary or nominal:

    dij(f)= 0 if xif = xjf, or dij

    (f)= 1 o.w.

    f is interval-based: use the normalized distance

    f is ordinal or ratio-scaled

    compute ranks rifand

    and treat zifas interval-scaled

    )(1

    )()(

    1),( fij

    pf

    f

    ij

    f

    ij

    p

    f djid

    1

    1

    f

    if

    Mr

    zif

  • 8/11/2019 CS583 Clustering

    19/40

    UIC - CS 594 19

    Major Clustering Techniques Partitioning algorithms: Construct various

    partitions and then evaluate them by somecriterion

    Hierarchy algorithms: Create a hierarchical

    decomposition of the set of data (or objects)using some criterion

    Density-based: based on connectivity anddensity functions

    Model-based: A model is hypothesized for eachof the clusters and the idea is to find the best fitof the model to each other.

  • 8/11/2019 CS583 Clustering

    20/40

    UIC - CS 594 20

    Partitioning Algorithms: BasicConcept

    Partitioning method: Construct a partition of a

    database Dof nobjects into a set of kclusters

    Given a k, find a partition of k clusters that

    optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions

    Heuristic methods: k-meansand k-medoidsalgorithms

    k-means: Each cluster is represented by the center ofthe cluster

    k-medoidsor PAM (Partition around medoids): Each

    cluster is represented by one of the objects in the cluster

  • 8/11/2019 CS583 Clustering

    21/40

    UIC - CS 594 21

    The K-Means Clustering

    Given k, the k-meansalgorithm is as follows:1) Choose k cluster centers to coincide with k

    randomly-chosenpoints

    2) Assign each data point to the closest cluster center

    3) Recompute the cluster centers using the currentcluster memberships.

    4) If a convergence criterion is not met, go to 2).

    Typical convergence criteria are: no (or minimal)

    reassignment of data points to new cluster centers, orminimal decrease in squared error.

    2

    1

    ||

    k

    iCp ii

    mpE

    pis a point and miis the mean ofcluster Ci

  • 8/11/2019 CS583 Clustering

    22/40

    UIC - CS 594 22

    Example

    For simplicity, 1 dimensional data and k=2.

    data: 1, 2, 5, 6,7

    K-means: Randomly select 5 and 6 as initial centroids;

    => Two clusters {1,2,5} and {6,7}; meanC1=8/3,meanC2=6.5

    => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6

    => no change.

    Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2= 2.5

  • 8/11/2019 CS583 Clustering

    23/40

    UIC - CS 594 23

    Comments on K-Means

    Strength:efficient: O(tkn), where nis # data points, kis# clusters, and t is # iterations. Normally, k, t

  • 8/11/2019 CS583 Clustering

    24/40

    UIC - CS 594 24

    Variations of the K-MeansMethod

    A few variants of the k-meanswhich differ in Selection of the initial kseeds

    Dissimilarity measures

    Strategies to calculate cluster means Handling categorical data: k-modes

    Replacing means of clusters with modes

    Using new dissimilarity measures to deal with

    categorical objects

    Using a frequency based method to update modes of

    clusters

  • 8/11/2019 CS583 Clustering

    25/40

    UIC - CS 594 25

    k-Medoids clustering method

    k-Means algorithm is sensitive to outliers Since an object with an extremely large value may

    substantially distort the distribution of the data.

    Medoidthe most centrally located point in acluster, as a representative point of the cluster.

    An example

    In contrast, a centroid is not necessarily inside acluster.

    Initial Medoids

  • 8/11/2019 CS583 Clustering

    26/40

    UIC - CS 594 26

    Partition Around Medoids PAM:

    1. Given k

    2. Randomly pick kinstances as initial medoids

    3. Assign each data point to the nearest medoid x

    4. Calculate the objective function the sum of dissimilarities of all points to their

    nearest medoids. (squared-error criterion)

    5. Randomly select an point y

    6. Swap x by y if the swap reduces the objectivefunction

    7. Repeat (3-6) until no change

  • 8/11/2019 CS583 Clustering

    27/40

    UIC - CS 594 27

    Comments on PAM

    Pam is more robust than k-means in thepresence of noise and outliers because a

    medoid is less influenced by outliers or

    other extreme values than a mean(why?)

    Pam works well for small data sets but

    does not scale wellfor large data sets. O(k(n-k)2)for each change

    where n is # of data, k is # of clusters

    Outlier (100 unit away)

  • 8/11/2019 CS583 Clustering

    28/40

    UIC - CS 594 28

    CLARA: Clustering Large Applications

    CLARA: Built in statistical analysis packages, such

    as S+ It draws multiple samplesof the data set, applies

    PAMon each sample, and gives the bestclustering as the output

    Strength: deals with larger data sets than PAM

    Weakness:

    Efficiency depends on the sample size

    A good clustering based on samples will notnecessarily represent a good clustering of the wholedata set if the sample is biased

    There are other scale-up methods e.g., CLARANS

  • 8/11/2019 CS583 Clustering

    29/40

    UIC - CS 594 29

    Hierarchical Clustering Use distance matrix for clustering. This method

    does not require the number of clusters kas aninput, but needs a termination condition

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    aa b

    d ec d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative

    divisive

  • 8/11/2019 CS583 Clustering

    30/40

    UIC - CS 594 30

    Agglomerative Clustering

    At the beginning, each data point forms a cluster(also called a node).Merge nodes/clusters that have the leastdissimilarity.

    Go on mergingEventually all nodes belong to the same cluster

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

  • 8/11/2019 CS583 Clustering

    31/40

    UIC - CS 594 31

    A DendrogramShows How the

    Clusters are Merged Hierarchically

    Decompose data objects into a several levels of nestedpartitioning (treeof clusters), called a dendrogram.

    A clusteringof the data objects is obtained by cutting

    the dendrogram at the desired level, then eachconnected componentforms a cluster.

  • 8/11/2019 CS583 Clustering

    32/40

    UIC - CS 594 32

    Divisive Clustering

    Inverse order of agglomerative clustering

    Eventually each node forms a cluster on its own

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

  • 8/11/2019 CS583 Clustering

    33/40

    UIC - CS 594 33

    More on Hierarchical Methods

    Major weakness of agglomerative clusteringmethods

    do not scale well: time complexity at least O(n2), wherenis the total number of objects

    can never undo what was done previously Integration of hierarchical with distance-based

    clustering to scale-up these clustering methods

    BIRCH (1996): uses CF-tree and incrementally adjusts

    the quality of sub-clusters CURE (1998): selects well-scattered points from the

    cluster and then shrinks them towards the center of thecluster by a specified fraction

  • 8/11/2019 CS583 Clustering

    34/40

    UIC - CS 594 34

    Summary

    Cluster analysisgroups objects based on theirsimilarityand has wide applications

    Measure of similarity can be computed for varioustypes of data

    Clustering algorithms can be categorizedintopartitioning methods, hierarchical methods,density-based methods, etc

    Clustering can also be used foroutlier detectionwhich are useful for fraud detection

    What is the best clustering algorithm?

  • 8/11/2019 CS583 Clustering

    35/40

    UIC - CS 594 35

    Other Data Mining Methods

  • 8/11/2019 CS583 Clustering

    36/40

    UIC - CS 594 36

    Sequence analysis

    Market basket analysis analyzes things thathappen at the same time.

    How about things happen over time?E.g., If a customer buys a bed, he/she is likely

    to come to buy a mattress later

    Sequential analysis needsA time stamp for each data record

    customer identification

  • 8/11/2019 CS583 Clustering

    37/40

    UIC - CS 594 37

    Sequence analysis (cont )

    The analysis shows which item come before, afteror at the same time as other items.

    Sequential patterns can be used for analyzingcause and effect.

    Other applications Finding cycles in association rules

    Some association rules hold strongly in certain periodsof time

    E.g., every Monday people buy item Xand Ytogether

    Stock market predicting

    Predicting possible failure in network, etc

  • 8/11/2019 CS583 Clustering

    38/40

    UIC - CS 594 38

    Discovering holes in data

    Holes are empty (sparse) regions in the dataspace that contain few or no data points. Holesmay represent impossible value combinationsinthe application domain.

    E.g., in a disease database, we may find thatcertain test values and/or symptoms do not gotogether, or when certain medicine is used,some test value never go beyond certain range.

    Such information could lead to significantdiscovery: a cure to a disease or some biologicallaw.

  • 8/11/2019 CS583 Clustering

    39/40

    UIC - CS 594 39

    Data and pattern visualization

    Data visualization: Use computer graphicseffect to reveal the patterns in data,

    2-D, 3-D scatter plots, bar charts, pie charts,line plots, animation, etc.

    Pattern visualization: Use good interfaceand graphics to present the results ofdata mining.

    Rule visualizer, cluster visualizer, etc

  • 8/11/2019 CS583 Clustering

    40/40

    UIC CS 594 40

    Scaling up data miningalgorithms

    Adapt data mining algorithms to work onvery large databases.

    Data reside on hard disk (too large to fit inmain memory)

    Make fewer passes over the data

    Quadratic algorithms are too expensive Many data mining algorithms are quadratic,

    especially, clustering algorithms.