14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam...

54
14.11.2001 Data mining: Clustering 1 Intro/Ass. Rules Intro/Ass. Rules Episodes Episodes Text Mining Text Mining Home Exam Home Exam 24./26.10. 30.10. Clustering Clustering KDD Process KDD Process Appl./Summary Appl./Summary 14.11. 21.11. 7.11. 28.11. Course on Data Mining Course on Data Mining (581550-4) (581550-4)

Transcript of 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam...

Page 1: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 1

Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

Home ExamHome Exam

24./26.10.

30.10.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary

14.11.

21.11.

7.11.

28.11.

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Page 2: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 2

Today 14.11.2001Today 14.11.2001Today 14.11.2001Today 14.11.2001

• Today's subjectToday's subject: :

o Classification, clusteringClassification, clustering

• Next week's programNext week's program: :

o Lecture: Data mining processLecture: Data mining process

o Exercise: Classification, Exercise: Classification, clusteringclustering

o Seminar: Classification, Seminar: Classification, clusteringclustering

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Page 3: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 3

I. Classification and prediction

II.II. Clustering and Clustering and similaritysimilarity

Classification and clusteringClassification and clustering

Page 4: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 4

• What is cluster analysis?What is cluster analysis?

• Similarity and dissimilaritySimilarity and dissimilarity

• Types of data in cluster analysisTypes of data in cluster analysis

• Major clustering methodsMajor clustering methods

• Partitioning methodsPartitioning methods

• Hierarchical methodsHierarchical methods

• Outlier analysisOutlier analysis

• SummarySummary

Cluster analysis Cluster analysis

OverviewOverview

Page 5: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 5

• Cluster:Cluster: a collection of data objects

o similar to one another within the same cluster

o dissimilar to the objects in the other clusters

• Aim of clustering:Aim of clustering: to group a set of data objects into clusters

What is cluster analysis?What is cluster analysis?

Page 6: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 6

Typical uses of clusteringTypical uses of clustering

• As a stand-alone tool to As a stand-alone tool to get insight into data get insight into data distributiondistribution

• As a preprocessing step As a preprocessing step for other algorithmsfor other algorithms

Used as?Used as?

Page 7: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 7

Applications of clusteringApplications of clustering

• Marketing:Marketing: discovering of distinct customer groups in a purchase database

• Land use:Land use: identifying of areas of similar land use in an earth observation database

• Insurance:Insurance: identifying groups of motor insurance policy holders with a high average claim cost

• City-planning:City-planning: identifying groups of houses according to their house type, value, and geographical location

Page 8: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 8

What is good clustering?What is good clustering?

• A good clusteringgood clustering method will produce high quality clusters with

o high intra-classintra-class similarity

o low inter-classinter-class similarity

• The qualityquality of a clustering result depends on

o the similarity measure used

o implementation of the similarity measure

• The quality quality of a clustering method is also measured by its ability to discover some or all of the hidden hidden patterns

Page 9: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 9

Requirements of clustering in Requirements of clustering in data mining (1)data mining (1)

• ScalabilityScalability

• Ability to deal with different Ability to deal with different types of attributestypes of attributes

• Discovery of clusters with Discovery of clusters with arbitrary shapearbitrary shape

• Minimal requirements for Minimal requirements for domain knowledge to determine domain knowledge to determine input parametersinput parameters

Page 10: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 10

Requirements of clustering in Requirements of clustering in data mining (2)data mining (2)

• Ability to deal with noise and Ability to deal with noise and outliersoutliers

• Insensitivity to order of input Insensitivity to order of input recordsrecords

• High dimensionalityHigh dimensionality

• Incorporation of user-specified Incorporation of user-specified constraintsconstraints

• Interpretability and usabilityInterpretability and usability

Page 11: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 11

Similarity and dissimilarity Similarity and dissimilarity between objects (1)between objects (1)

• There is no single definitionno single definition of similarity or dissimilarity between data objects

• The definition of similarity or similarity or dissimilarity between objectsdissimilarity between objects depends on

o the type of the data considered

o what kind of similarity we are looking for

Page 12: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 12

Similarity and dissimilarity Similarity and dissimilarity between objects (2)between objects (2)• Similarity/dissimilarity between objects is

often expressed in terms of a distancedistance measure measure d(x,y)

• Ideally, every distance measure should be a metricmetric, i.e., it should satisfy the following conditions:

),(),(),( 4.

),(),( 3.

iff 0),( 2.

0),( 1.

zydyxdzxd

xydyxd

yxyxd

yxd

Page 13: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 13

Type of data in cluster analysisType of data in cluster analysis

• Interval-scaled variablesInterval-scaled variables

• Binary variablesBinary variables

• Nominal, ordinal, and ratio Nominal, ordinal, and ratio variablesvariables

• Variables of mixed typesVariables of mixed types

• Complex data typesComplex data types

Page 14: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 14

Interval-scaled variables (1)Interval-scaled variables (1)

• Continuous measurements Continuous measurements of a roughly linear scale

• For example, weight, height and age

• The measurement unit measurement unit can affect the cluster analysis

• To avoid dependence on the measurement unit, we should standardize standardize the data

Page 15: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 15

Interval-scaled variables (2)Interval-scaled variables (2)

To standardize the measurements:

• calculate the mean absolute deviation mean absolute deviation

where and

• calculate the standardized measurement ( standardized measurement (z-scorez-score))

,)...21

1nffff

xx(xn m

|)|...|||(|121 fnffffff

mxmxmxns

f

fifif s

mx z

Page 16: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 16

Interval-scaled variables (3)Interval-scaled variables (3)

• One group of popular distance measures for interval-scaled variables are Minkowski distances Minkowski distances

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

Page 17: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 17

Interval-scaled variables (4)Interval-scaled variables (4)

• If q = 1, the distance measure is Manhattan (Manhattan (or city block) distancecity block) distance

• If q = 2, the distance measure is Euclidean Euclidean distancedistance

||...||||),(2211 pp jxixjxixjxixjid

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 18: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 18

Binary variables (1)Binary variables (1)

• A binary variable has only two statestwo states: 0 or 1

• A contingency tablecontingency table for binary data

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

Page 19: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 19

Binary variables (2)Binary variables (2)

• Simple matching coefficientSimple matching coefficient (invariant similarity, if the binary variable is symmetricsymmetric):

• Jaccard coefficientJaccard coefficient (noninvariant similarity, if the

binary variable is asymmetricasymmetric):

dcbacb jid

),(

cbacb jid

),(

Page 20: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 20

Binary variables (3)Binary variables (3)

ExampleExample: dissimilarity between binary variables:

• a patient record table

• eight attributes, of which

o gender is a symmetric attribute, and

o the remaining attributes are asymmetric binary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

Page 21: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 21

Binary variables (4)Binary variables (4)

• Let the values YY and PP be set to 1, and the value NN be set to 0

• Compute distances between patients based on the asymmetric variables by using Jaccard coefficient

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Page 22: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 22

Nominal variablesNominal variables

• A generalization generalization of the binary variable in that it can take

more than 2 statesmore than 2 states, e.g., red, yellow, blue, green

• Method 1: simple matching simple matching

o m: m: # of matches,, p: p: total # of variables

• Method 2: use a large number of binary variables use a large number of binary variables

o create a new binary variable for each of the MM

nominal states

pmpjid ),(

Page 23: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 23

Ordinal variablesOrdinal variables

• An ordinal variable can be discrete or continuous discrete or continuous

• Order of values is important, e.g., rank

• Can be treated like interval-scaled interval-scaled

o replacing xif by their rank

o map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

o compute the dissimilarity using methods for interval-scaled variables

11

f

ifif M

rz

} ..., ,1 { fif

Mr

Page 24: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 24

Ratio-scaled variablesRatio-scaled variables

• A positive measurement on a nonlinear scaleA positive measurement on a nonlinear scale, approximately at exponential scale

o for example, AeAeBtBt or AeAe-Bt-Bt

• Methods:Methods:

o treat them like interval-scaled variables — not a good choice! (why?)

o apply logarithmic transformation yyif if = log(x= log(xifif))

o treat them as continuous ordinal data and treat their rank as interval-scaled

Page 25: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 25

Variables of mixed types (1)Variables of mixed types (1)

• A database may contain all the six types of variables

• One may use a weighted formulaweighted formula to combine their effects:

where

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

1 otherwise

;0or

missing, is or if 0)(

δ(f)

ij

jfif

jfiff

ij

xx

xx

Page 26: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 26

Variables of mixed types (2)Variables of mixed types (2)

Contribution of variableContribution of variable f to distance to distance d(i,j)::• if f is binary or nominal:

• if f is interval-based: use the normalized distance

• if f is ordinal or ratio-scaledo compute ranks rif and

o and treat zif as interval-scaled

1

1

f

if

Mrz

if

1 otherwise ; if 0)()( dd

f

ijjfif

f

ij xx

Page 27: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 27

Complex data typesComplex data types

• All objects considered in data mining are not relational => complex types of datacomplex types of datao examples of such data are spatial data, multimedia

data, genetic data, time-series data, text data and data collected from World-Wide Web

• Often totally different similarity or dissimilarity measures than aboveo can, for example, mean using of string and/or

sequence matching, or methods of information retrieval

Page 28: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 28

Major clustering methodsMajor clustering methods

• Partitioning methods

• Hierarchical methodsHierarchical methods

• Density-based methodsDensity-based methods

• Grid-based methods Grid-based methods

• Model-based methods Model-based methods (conceptual clustering, neural networks)

Page 29: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 29

Partitioning methodsPartitioning methods

• A partitioning method:A partitioning method: construct a partition of a database D of nn objects into a set of k clusters such that

o each cluster contains at least one object

o each object belongs to exactly one cluster

• Given a k, find a partition of kk clusters clusters that optimizes the chosen partitioning partitioning criterioncriterion

Page 30: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 30

Criteria for judging Criteria for judging the quality of partitionsthe quality of partitions

• Global optimalGlobal optimal: exhaustively enumerate all partitions

• Heuristic methodsHeuristic methods:

o k-meansk-means (MacQueen’67): each cluster is represented by the center of the cluster (centroidcentroid)

o k-medoidsk-medoids (Kaufman & Rousseeuw’87): each cluster is represented by one of the objects in the cluster (medoidmedoid)

Page 31: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 31

K-means clustering method (1)K-means clustering method (1)• Input to the algorithmInput to the algorithm: the number of clusters k, and a

database of n objects

• Algorithm consists of four stepsAlgorithm consists of four steps:

1. partition object into k nonempty subsets/clusters

2. compute a seed points as the centroidcentroid (the mean of the objects in the cluster) for each cluster in the current partition

3. assign each object to the cluster with the nearest centroid

4. go back to Step 2, stop when there are no more new assignments

Page 32: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 32

K-means clustering method (2)K-means clustering method (2)

Alternative algorithm also consists of four stepsAlternative algorithm also consists of four steps:

1. arbitrarily choose k objects as the initial cluster centers (centroids)

2. (re)assign each object to the cluster with the nearest centroid

3. update the centroids

4. go back to Step 2, stop when there are no more new assignments

Page 33: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 33

K-means clustering method - K-means clustering method - ExampleExample

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 34: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 34

Strengths of Strengths of K-means clustering methodK-means clustering method

• Relatively scalable Relatively scalable in processing large data sets

• Relatively efficientRelatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

• Often terminates at a locallocal optimum; the global optimumglobal optimum may be found using techniques such as genetic algorithms

Page 35: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 35

Weaknesses of Weaknesses of K-means clustering methodK-means clustering method

• ApplicableApplicable only when the mean of objects is defined

• Need to specifyNeed to specify k, the number of clusters, in advance

• Unable Unable to handle noisy data and outliers

• Not suitableNot suitable to discover clusters with non-convex shapes, or clusters of very different size

Page 36: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 36

Variations of Variations of K-means clustering method (1)K-means clustering method (1)

• A few variants of the k-means which A few variants of the k-means which differ indiffer in

o selection of the initial k centroids

o dissimilarity calculations

o strategies for calculating cluster centroids

Page 37: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 37

Variations of Variations of K-means clustering method (2)K-means clustering method (2)

• Handling categorical data: k-modesk-modes (Huang’98)

o replacing means of clusters with modesmodes

o using new dissimilarity measures to deal with categorical objects

o using a frequency-basedfrequency-based method to update modes of clusters

• A mixture of categorical and numerical data: k-prototypek-prototype method

Page 38: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 38

K-medoids clustering methodK-medoids clustering method• Input to the algorithmInput to the algorithm: the number of clusters k, and a

database of n objects

• Algorithm consists of four stepsAlgorithm consists of four steps:

1. arbitrarily choose k objects as the initial medoidsmedoids (representative objects)

2. assign each remaining object to the cluster with the nearest medoid

3. select a nonmedoid and replace one of the medoids with it if this improves the clustering

4. go back to Step 2, stop when there are no more new assignments

Page 39: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 39

Hierarchical methodsHierarchical methods

• A hierarchical method:A hierarchical method: construct a hierarchy of clustering, not just a single partition of objects

• The number of clusters kk is not required as an input

• Use a distance matrixdistance matrix as clustering criteria

• A termination conditiontermination condition can be used (e.g., a number of clusters)

Page 40: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 40

A tree of clusteringsA tree of clusterings

• The hierarchy of clustering is ofter given as a clustering treeclustering tree, also called a dendrogramdendrogram

o leaves of the tree represent the individual objects

o internal nodes of the tree represent the clusters

Page 41: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 41

Two types of Two types of hierarchical methods (1)hierarchical methods (1)

Two main types of hierarchical clustering techniques:Two main types of hierarchical clustering techniques:• agglomerativeagglomerative (bottom-up):

o place each object in its own cluster (a singleton)o merge in each step the two most similar clusters

until there is only one cluster left or the termination condition is satisfied

• divisivedivisive (top-down):o start with one big cluster containing all the objectso divide the most distinctive cluster into smaller

clusters and proceed until there are n clusters or the termination condition is satisfied

Page 42: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 42

Two types of Two types of hierarchical methods (2)hierarchical methods (2)

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerativeagglomerative

divisivedivisive

Page 43: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 43

Inter-cluster distancesInter-cluster distances

• Three widely used ways of defining the inter-cluster inter-cluster distancedistance, i.e., the distance between two separate clusters, are

o single linkagesingle linkage method method (nearest neighbor):

o complete linkagecomplete linkage method method (furthest neighbor):

o average linkageaverage linkage method method (unweighted pair-group average):

),( min),( , yxdjidji CyCx

),( max),( , yxdjidji CyCx

),( ),( , yxdavgjidji CyCx

Page 44: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 44

Strengths of Strengths of hierarchical methodshierarchical methods

• Conceptually simpleConceptually simple

• Theoretical properties Theoretical properties are well understoodunderstood

• When clusters are merged/split, the decision is permanent => the decision is permanent => the number of different the number of different alternatives alternatives that need to be examined is reduced reduced

Page 45: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 45

Weaknesses of Weaknesses of hierarchical methodshierarchical methods

• Merging/splitting Merging/splitting of clusters is permanent => permanent => erroneous decisions are impossible to correct impossible to correct later

• Divisive methods can be computational hardcomputational hard

• Methods are not not (necessarily) scalable scalable for large data sets

Page 46: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 46

Outlier analysis (1)Outlier analysis (1)

• Outliers Outliers

o are objects that are considerably dissimilar from the remainder of the data

o can be caused by a measurement or execution error, or

o are the result of inherent data variability

• Many data mining algorithms tryMany data mining algorithms try

o to minimize the influence of outliers

o to eliminate the outliers

Page 47: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 47

Outlier analysis (2)Outlier analysis (2)

• Minimizing the effect of outliers and/or eliminating the outliers may cause information lossinformation loss

• Outliers themselves may be of interest => outlier mining=> outlier mining

• Applications of outlier miningApplications of outlier mining

o Fraud detection

o Customized marketing

o Medical treatments

Page 48: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 48

• Cluster analysis groups objects Cluster analysis groups objects based on their similaritybased on their similarity

• Cluster analysis has wide Cluster analysis has wide applicationsapplications

• Measure of similarity can be Measure of similarity can be computed for various type of computed for various type of datadata

• Selection of similarity measure Selection of similarity measure is dependent on the data used is dependent on the data used and the type of similarity we and the type of similarity we are searching forare searching for

Summary (1)Summary (1)

Page 49: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 49

• Clustering algorithms can be Clustering algorithms can be categorized into categorized into

o partitioning methods,partitioning methods,

o hierarchical methods, hierarchical methods,

o density-based methods, density-based methods,

o grid-based methods, and grid-based methods, and

o model-based methodsmodel-based methods

• There are still lots of research There are still lots of research issues on cluster analysisissues on cluster analysis

Summary (2)Summary (2)

Page 50: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 50

Seminar Presentations/Groups 7-8Seminar Presentations/Groups 7-8

Classification of spatial dataClassification of spatial dataClassification of spatial dataClassification of spatial data

K. Koperski, J. Han, N. Stefanovic: K. Koperski, J. Han, N. Stefanovic: “An Efficient Two-Step Method of “An Efficient Two-Step Method of Classification of Spatial Data", Classification of Spatial Data", SDH’98SDH’98

Page 51: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 51

Seminar Presentations/Groups 7-8Seminar Presentations/Groups 7-8

WEBSOMWEBSOMWEBSOMWEBSOM

K. Lagus, T. Honkela, S. Kaski, T. K. Lagus, T. Honkela, S. Kaski, T. Kohonen: “Self-organizing Maps Kohonen: “Self-organizing Maps of Document Collections: A New of Document Collections: A New Approach to Interactive Approach to Interactive Exploration”, KDD’96Exploration”, KDD’96

T. Honkela, S. Kaski, K. Lagus, T. T. Honkela, S. Kaski, K. Lagus, T. Kohonen: “WEBSOM – Self-Kohonen: “WEBSOM – Self-Organizing Maps of Document Organizing Maps of Document Collections”, WSOM’97Collections”, WSOM’97

Page 52: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 52

Thanks to Jiawei Han from Simon Fraser University

for his slides which greatly helped in preparing this lecture!

Course on Data MiningCourse on Data Mining

Page 53: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 53

References - clusteringReferences - clustering

• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98

• M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.

• M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99.

• P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996

• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96.

• M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95.

• D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.

• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98.

Page 54: 14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.

14.11.2001 Data mining: Clustering 54

References - clusteringReferences - clustering

• S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.

• A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.

• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.

• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.

• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.

• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.

• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.

• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.

• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.

• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.