CS583 Clustering

8/11/2019 CS583 Clustering

1/40

UIC - CS 594 1

Chapter 5: Clustering


2/40

UIC - CS 594 2

Searching for groups Clustering is unsupervised or undirected.

Unlike classification, in clustering, no pre-

classified data. Search for groups or clusters of data

points (records) that are similar to oneanother.

Similar points may mean: similarcustomers, products, that will behave insimilar ways.


3/40

UIC - CS 594 3

Group similar points together Group points into classes using some

distance measures.

Within-cluster distance, and between clusterdistance

Applications:

As a stand-alone toolto get insight into datadistribution

As a preprocessing stepfor other algorithms


4/40

UIC - CS 594 4

An Illustration


5/40

UIC - CS 594 5

Examples of ClusteringApplications

Marketing: Help marketers discover distinctgroups in their customer bases, and then use this

knowledge to develop targeted marketing

programs Insurance:Identifying groups of motor insurance

policy holders with some interesting

characteristics.

City-planning:Identifying groups of houses

according to their house type, value, and

geographical location


6/40

UIC - CS 594 6

Concepts of Clustering Clusters

Different ways of

representing clusters Division with boundaries

Spheres

Probabilistic

Dendrograms

1 2 3

I1

I2

In

0.5 0.2 0.3


7/40

UIC - CS 594 7

Clustering Clustering quality

Inter-clusters distance maximized

Intra-clusters distance minimized

The qualityof a clustering result depends on boththe similarity measure used by the method and itsapplication.

The qualityof a clustering method is alsomeasured by its ability to discover some or all ofthe hiddenpatterns

Clustering vs. classification

Which one is more difficult? Why?

There are a huge number of clustering techniques.


8/40

UIC - CS 594 8

Dissimilarity/Distance Measure

Dissimilarity/Similarity metric: Similarity isexpressed in terms of a distance function, whichis typically metric: d (i, j)

The definitions of distance functions are usually

very different for interval-scaled, boolean,categorical, ordinal and ratio variables.

Weights should be associated with different

variables based on applications and datasemantics.

It is hard to define similar enough or goodenough. The answer is typically highly subjective.


9/40

UIC - CS 594 9

Types of data in clustering

analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types


10/40

UIC - CS 594 10

Interval-valued variables

Continuous measurements in a roughly linearscale, e.g., weight, height, temperature, etc

Standardize data (depending on applications)

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (z-score)

.)...21

1nffff

xx(xnm

|)|...|||(|121 fnffffff mxmxmxns

f

fif

if s

mxz


11/40

UIC - CS 594 11

Similarity Between Objects Distance: Measure the similarity ordissimilarity

between two data objects

Some popular ones include: Minkowski

distance:

where (xi1, xi2, , xip) and(xj1, xj2, , xjp) are two p-

dimensional data objects, and qis a positive integer

If q= 1, dis Manhattan distance

q q

pp

qq

jxixjxixjxixjid )||...|||(|),( 2211

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid


12/40

UIC - CS 594 12

Similarity Between Objects (Cont.)

If q= 2,d is Euclidean distance:

Properties d(i,j)0

d(i,i)= 0

d(i,j)= d(j,i)

d(i,j)

d(i,k)+ d(k,j)

Also, one can use weighted distance, and

many other similarity/distance measures.

)||...|||(|),( 2222

2

11 pp jx

ix

jx

ix

jx

ixjid


13/40


14/40

UIC - CS 594 14

Dissimilarity of Binary Variables Example

gender is a symmetric attribute (not used below) the remaining attributes are asymmetric attributes

let the values Y and P be set to 1, and the value Nbe set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0

211

21),(

67.0111

11),(

33.0

102

10),(

maryj imd

j imjackd

maryjackd


15/40

UIC - CS 594 15

Nominal Variables A generalization of the binary variable in that it

can take more than 2 states, e.g., red, yellow,

blue, green, etc

Method 1: Simple matching

m: # of matches,p: total # of variables

Method 2: use a large number of binary variables

creating a new binary variable for each of the M

nominal states

pmp

jid ),(


16/40

UIC - CS 594 16

Ordinal Variables

An ordinal variable can be discrete or continuous Order is important, e.g., rank

Can be treated like interval-scaled (f is a variable)

replace xif by their ranks map the range of each variable onto [0, 1] by replacing

i-thobject in thef-th variable by

compute the dissimilarity using methods for interval-

scaled variables

1

1

f

ifif M

rz

},...,1{ fif Mr


17/40

UIC - CS 594 17

Ratio-Scaled Variables Ratio-scaled variable: a measurement on a

nonlinear scale, approximately at exponentialscale, such asAeBtorAe-Bt, e.g., growth of a

bacteria population.

Methods: treat them like interval-scaled variablesnot a good idea!

(why?the scale can be distorted)

apply logarithmic transformation

yif =log(xif)

treat them as continuous ordinal data and then treat their

ranks as interval-scaled


18/40

UIC - CS 594 18

Variables of Mixed Types A database may contain all six types of variables

symmetric binary, asymmetric binary, nominal,ordinal, interval and ratio

One may use a weighted formula to combinetheir effects

f is binary or nominal:

dij(f)= 0 if xif = xjf, or dij

(f)= 1 o.w.

f is interval-based: use the normalized distance

f is ordinal or ratio-scaled

compute ranks rifand

and treat zifas interval-scaled

)(1

)()(

1),( fij

pf

f

ij

f

ij

p

f djid

1

1

f

if

Mr

zif


19/40

UIC - CS 594 19

Major Clustering Techniques Partitioning algorithms: Construct various

partitions and then evaluate them by somecriterion

Hierarchy algorithms: Create a hierarchical

decomposition of the set of data (or objects)using some criterion

Density-based: based on connectivity anddensity functions

Model-based: A model is hypothesized for eachof the clusters and the idea is to find the best fitof the model to each other.


20/40

UIC - CS 594 20

Partitioning Algorithms: BasicConcept

Partitioning method: Construct a partition of a

database Dof nobjects into a set of kclusters

Given a k, find a partition of k clusters that

optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-meansand k-medoidsalgorithms

k-means: Each cluster is represented by the center ofthe cluster

k-medoidsor PAM (Partition around medoids): Each

cluster is represented by one of the objects in the cluster


21/40

UIC - CS 594 21

The K-Means Clustering

Given k, the k-meansalgorithm is as follows:1) Choose k cluster centers to coincide with k

randomly-chosenpoints

2) Assign each data point to the closest cluster center

3) Recompute the cluster centers using the currentcluster memberships.

4) If a convergence criterion is not met, go to 2).

Typical convergence criteria are: no (or minimal)

reassignment of data points to new cluster centers, orminimal decrease in squared error.

2

1

||

k

iCp ii

mpE

pis a point and miis the mean ofcluster Ci


22/40

UIC - CS 594 22

Example

For simplicity, 1 dimensional data and k=2.

data: 1, 2, 5, 6,7

K-means: Randomly select 5 and 6 as initial centroids;

=> Two clusters {1,2,5} and {6,7}; meanC1=8/3,meanC2=6.5

=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6

=> no change.

Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2= 2.5


23/40

UIC - CS 594 23

Comments on K-Means

Strength:efficient: O(tkn), where nis # data points, kis# clusters, and t is # iterations. Normally, k, t


24/40

UIC - CS 594 24

Variations of the K-MeansMethod

A few variants of the k-meanswhich differ in Selection of the initial kseeds

Dissimilarity measures

Strategies to calculate cluster means Handling categorical data: k-modes

Replacing means of clusters with modes

Using new dissimilarity measures to deal with

categorical objects

Using a frequency based method to update modes of

clusters


25/40

UIC - CS 594 25

k-Medoids clustering method

k-Means algorithm is sensitive to outliers Since an object with an extremely large value may

substantially distort the distribution of the data.

Medoidthe most centrally located point in acluster, as a representative point of the cluster.

An example

In contrast, a centroid is not necessarily inside acluster.

Initial Medoids


26/40

UIC - CS 594 26

Partition Around Medoids PAM:

1. Given k

2. Randomly pick kinstances as initial medoids

3. Assign each data point to the nearest medoid x

4. Calculate the objective function the sum of dissimilarities of all points to their

nearest medoids. (squared-error criterion)

5. Randomly select an point y

6. Swap x by y if the swap reduces the objectivefunction

7. Repeat (3-6) until no change


27/40

UIC - CS 594 27

Comments on PAM

Pam is more robust than k-means in thepresence of noise and outliers because a

medoid is less influenced by outliers or

other extreme values than a mean(why?)

Pam works well for small data sets but

does not scale wellfor large data sets. O(k(n-k)2)for each change

where n is # of data, k is # of clusters

Outlier (100 unit away)


28/40

UIC - CS 594 28

CLARA: Clustering Large Applications

CLARA: Built in statistical analysis packages, such

as S+ It draws multiple samplesof the data set, applies

PAMon each sample, and gives the bestclustering as the output

Strength: deals with larger data sets than PAM

Weakness:

Efficiency depends on the sample size

A good clustering based on samples will notnecessarily represent a good clustering of the wholedata set if the sample is biased

There are other scale-up methods e.g., CLARANS


29/40

UIC - CS 594 29

Hierarchical Clustering Use distance matrix for clustering. This method

does not require the number of clusters kas aninput, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive


30/40

UIC - CS 594 30

Agglomerative Clustering

At the beginning, each data point forms a cluster(also called a node).Merge nodes/clusters that have the leastdissimilarity.

Go on mergingEventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


31/40

UIC - CS 594 31

A DendrogramShows How the

Clusters are Merged Hierarchically

Decompose data objects into a several levels of nestedpartitioning (treeof clusters), called a dendrogram.

A clusteringof the data objects is obtained by cutting

the dendrogram at the desired level, then eachconnected componentforms a cluster.


32/40

UIC - CS 594 32

Divisive Clustering

Inverse order of agglomerative clustering

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


33/40

UIC - CS 594 33

More on Hierarchical Methods

Major weakness of agglomerative clusteringmethods

do not scale well: time complexity at least O(n2), wherenis the total number of objects

can never undo what was done previously Integration of hierarchical with distance-based

clustering to scale-up these clustering methods

BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of thecluster by a specified fraction


34/40

UIC - CS 594 34

Summary

Cluster analysisgroups objects based on theirsimilarityand has wide applications

Measure of similarity can be computed for varioustypes of data

Clustering algorithms can be categorizedintopartitioning methods, hierarchical methods,density-based methods, etc

Clustering can also be used foroutlier detectionwhich are useful for fraud detection

What is the best clustering algorithm?


35/40

UIC - CS 594 35

Other Data Mining Methods


36/40

UIC - CS 594 36

Sequence analysis

Market basket analysis analyzes things thathappen at the same time.

How about things happen over time?E.g., If a customer buys a bed, he/she is likely

to come to buy a mattress later

Sequential analysis needsA time stamp for each data record

customer identification


37/40

UIC - CS 594 37

Sequence analysis (cont )

The analysis shows which item come before, afteror at the same time as other items.

Sequential patterns can be used for analyzingcause and effect.

Other applications Finding cycles in association rules

Some association rules hold strongly in certain periodsof time

E.g., every Monday people buy item Xand Ytogether

Stock market predicting

Predicting possible failure in network, etc


38/40

UIC - CS 594 38

Discovering holes in data

Holes are empty (sparse) regions in the dataspace that contain few or no data points. Holesmay represent impossible value combinationsinthe application domain.

E.g., in a disease database, we may find thatcertain test values and/or symptoms do not gotogether, or when certain medicine is used,some test value never go beyond certain range.

Such information could lead to significantdiscovery: a cure to a disease or some biologicallaw.


39/40

UIC - CS 594 39

Data and pattern visualization

Data visualization: Use computer graphicseffect to reveal the patterns in data,

2-D, 3-D scatter plots, bar charts, pie charts,line plots, animation, etc.

Pattern visualization: Use good interfaceand graphics to present the results ofdata mining.

Rule visualizer, cluster visualizer, etc


40/40

UIC CS 594 40

Scaling up data miningalgorithms

Adapt data mining algorithms to work onvery large databases.

Data reside on hard disk (too large to fit inmain memory)

Make fewer passes over the data

Quadratic algorithms are too expensive Many data mining algorithms are quadratic,

especially, clustering algorithms.

CS583 Clustering

Documents

Transcript of CS583 Clustering