Segmentação ( Clustering ) (baseado nos slides do Han)

Segmentação (Segmentação (ClusteringClustering))(baseado nos slides do (baseado nos slides do

Han)Han)

SAD Tagus 2004/05 H. Galhardas

Non-supervised Non-supervised Learning: Learning: Cluster Cluster

AnalysisAnalysis What is Cluster Analysis?What is Cluster Analysis? Types of Data in Cluster AnalysisTypes of Data in Cluster Analysis A Categorization of Major Clustering MethodsA Categorization of Major Clustering Methods Partitioning MethodsPartitioning Methods Hierarchical MethodsHierarchical Methods Summary Summary


What is Cluster What is Cluster Analysis?Analysis?

ClusterCluster: a collection of data objects: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

Cluster analysisCluster analysis Grouping a set of data objects into clusters

Clustering is Clustering is unsupervised classificationunsupervised classification: no : no predefined classespredefined classes

Typical applicationsTypical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms


General Applications of General Applications of Clustering Clustering

Pattern RecognitionPattern Recognition Spatial Data Analysis Spatial Data Analysis

create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data

mining Image ProcessingImage Processing Economic Science (especially market research)Economic Science (especially market research) WWWWWW

Document classification Cluster Weblog data to discover groups of similar access

patterns


Examples of Clustering Examples of Clustering ApplicationsApplications

MarketingMarketing: Help marketers discover distinct groups in their : Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop customer bases, and then use this knowledge to develop targeted marketing programstargeted marketing programs

Land useLand use: Identification of areas of similar land use in an : Identification of areas of similar land use in an earth observation databaseearth observation database

InsuranceInsurance: Identifying groups of motor insurance policy : Identifying groups of motor insurance policy holders with a high average claim costholders with a high average claim cost

City-planningCity-planning: Identifying groups of houses according to : Identifying groups of houses according to their house type, value, and geographical locationtheir house type, value, and geographical location

Earth-quake studies:Earth-quake studies: Observed earth quake epicenters Observed earth quake epicenters should be clustered along continent faultsshould be clustered along continent faults


What Is Good Clustering?What Is Good Clustering?

A A good clusteringgood clustering method will produce high quality method will produce high quality

clusters withclusters with

high intra-class similarity

low inter-class similarity

The The qualityquality of a clustering result depends on both the of a clustering result depends on both the

similarity measure used by the method and its similarity measure used by the method and its

implementation.implementation.

The The qualityquality of a clustering method is also measured by its of a clustering method is also measured by its

ability to discover some or all of the ability to discover some or all of the hiddenhidden patterns patterns..


Requirements of Requirements of Clustering in Data Mining Clustering in Data Mining

ScalabilityScalability Ability to deal with different types of attributesAbility to deal with different types of attributes Discovery of clusters with arbitrary shapeDiscovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine Minimal requirements for domain knowledge to determine

input parametersinput parameters Able to deal with noise and outliersAble to deal with noise and outliers Insensitive to order of input recordsInsensitive to order of input records High dimensionalityHigh dimensionality Incorporation of user-specified constraintsIncorporation of user-specified constraints Interpretability and usabilityInterpretability and usability


Data Data StructuresStructures

Data matrixData matrix (two modes)

Dissimilarity matrixDissimilarity matrix (one mode)

npx...

nfx...

n1x

...............ip

x...if

x...i1

x

...............1p

x...1f

x...11

x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0


Measure the Quality Measure the Quality of Clusteringof Clustering

Dissimilarity/Similarity metricDissimilarity/Similarity metric: Similarity is expressed in terms : Similarity is expressed in terms of a distance function, which is typically a metric:of a distance function, which is typically a metric:dd((i, ji, j))

There is a separate There is a separate “quality” function“quality” function that measures the that measures the “goodness” of a cluster.“goodness” of a cluster.

The definitions of distance functions are usually very The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and different for interval-scaled, boolean, categorical, ordinal and ratio variables.ratio variables.

WeightsWeights should be associated with different variables based should be associated with different variables based on applications and data semantics.on applications and data semantics.

It is hard to define “similar enough” or “good enough” It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.


Type of data in clustering Type of data in clustering analysisanalysis

Interval-scaled variables:Interval-scaled variables:

Binary variables:Binary variables:

Nominal, ordinal, and ratio variables:Nominal, ordinal, and ratio variables:

Variables of mixed types:Variables of mixed types:


Interval-valued variablesInterval-valued variables

Standardize dataStandardize data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using Using mean absolute deviation is more robust than using

standard deviation standard deviation

.)...21

1nffff

xx(xn m

|)|...|||(|121 fnffffff

mxmxmxns

f

fifif s

mx z


Similarity and Dissimilarity Similarity and Dissimilarity Between ObjectsBetween Objects

DistancesDistances are normally used to measure the are normally used to measure the similaritysimilarity

or or dissimilaritydissimilarity between two data objects between two data objects

Some popular ones include: Some popular ones include: Minkowski distanceMinkowski distance

where where ii = ( = (xxi1i1, , xxi2i2, …, , …, xxipip) and) and j j = ( = (xxj1j1, , xxj2j2, …, , …, xxjpjp) are two ) are two

pp-dimensional data objects, and -dimensional data objects, and qq is a positive integer is a positive integer

If If qq = = 11, , dd is is Manhattan distanceManhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp jxixjxixjxixjid


Similarity and Dissimilarity Similarity and Dissimilarity Between Objects (Cont.)Between Objects (Cont.)

If qIf q = = 22,, d d is is Euclidean distanceEuclidean distance::

PropertiesProperties::d(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)

Also, one can use weighted distance, Also, one can use weighted distance, parametric Pearson product moment parametric Pearson product moment correlation, or other disimilarity measurescorrelation, or other disimilarity measures

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid


Partitioning Algorithms: Partitioning Algorithms: Basic ConceptBasic Concept

Partitioning methodPartitioning method: Construct a partition of a : Construct a partition of a database database DD of of nn objects into a set of objects into a set of kk clusters clusters

Given a Given a kk, find a partition of , find a partition of k clusters k clusters that that optimizes the chosen partitioning criterionoptimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by

the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman

& Rousseeuw’87): Each cluster is represented by one of the objects in the cluster


The The K-MeansK-Means Clustering Clustering MethodMethod

Given Given kk, the , the k-meansk-means algorithm is algorithm is

implemented in four steps:implemented in four steps:

1. Partition objects into k nonempty subsets

2. Compute seed points as the centroids of the

clusters of the current partition (the centroid is

the center, i.e., mean point, of the cluster)

3. Assign each object to the cluster with the nearest

seed point

4. Go back to Step 2, stop when no more new

assignment


The The K-MeansK-Means Clustering Clustering MethodMethod

ExampleExample

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


Comments on the Comments on the K-MeansK-Means MethodMethod

StrengthStrength Relatively efficientRelatively efficient: : OO((tkntkn), where ), where nn is # objects, is # objects, kk is # is #

clusters, and clusters, and t t is # iterations. Normally, is # iterations. Normally, kk, , tt << << nn.. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

CommentComment Often terminates at a Often terminates at a local optimumlocal optimum. The . The global global

optimumoptimum may be found using techniques such as: may be found using techniques such as: deterministic deterministic

annealingannealing and and genetic algorithmsgenetic algorithms

WeaknessWeakness Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes


Variations of the Variations of the K-MeansK-Means MethodMethod

A few variants of the A few variants of the k-meansk-means which differ in which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data: Handling categorical data: k-modesk-modes (Huang’98)(Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method


What is the problem of k-What is the problem of k-Means Method?Means Method?

The k-means algorithm is sensitive to outliers !The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially

distort the distribution of the data.

K-MedoidsK-Medoids: Instead of taking the : Instead of taking the meanmean value of the object value of the object

in a cluster as a reference point, in a cluster as a reference point, medoidsmedoids can be used, can be used,

which is the which is the most centrally locatedmost centrally located object in a cluster. object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


TheThe K K--MedoidsMedoids Clustering Clustering MethodMethod

Find Find representativerepresentative objects, called objects, called medoids, in , in

clustersclusters

PAMPAM (Partitioning Around Medoids, 1987) (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces

one of the medoids by one of the non-medoids if it

improves the total distance of the resulting clustering

PAM works effectively for small data sets, but does not

scale well for large data sets


Typical k-medoids algorithm Typical k-medoids algorithm (PAM)(PAM)

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids Randomly select a

nonmedoid object,Oramdom

Compute total cost of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


PAM (Partitioning Around PAM (Partitioning Around Medoids) (1987)Medoids) (1987)

Use real object to represent the clusterUse real object to represent the cluster

1. Select k representative objects arbitrarily

2. For each pair of non-selected object h and selected

object i, calculate the total swapping cost TCih

3. For each pair of i and h,

• If TCih < 0, i is replaced by h

• Then assign each non-selected object to the most

similar representative object

4. Repeat steps 2-3 until there is no change


Cost function for k-medoidsCost function for k-medoids


PAM ClusteringPAM Clustering

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

j

ih

t

Cjih = 0

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

i h

j

Cjih = d(j, h) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h

i t

j

Cjih = d(j, t) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

ih j

Cjih = d(j, h) - d(j, t)


What is the problem What is the problem with PAM?with PAM?

Pam is Pam is more robustmore robust than k-means in the than k-means in the

presence of noise and outliers because a medoid presence of noise and outliers because a medoid

is less influenced by outliers or other extreme is less influenced by outliers or other extreme

values than a meanvalues than a mean

Pam works efficiently for small data sets but Pam works efficiently for small data sets but

does not does not scale wellscale well for large data sets. for large data sets. O(k(n-k)2 ) for each iteration, where n is # of data,k is #

of cluster


SummarySummary

Cluster analysisCluster analysis groups objects based on their groups objects based on their similaritysimilarity and has wide applicationsand has wide applications

Measure of similarity can be computed for Measure of similarity can be computed for various types various types of dataof data

Clustering algorithms can be Clustering algorithms can be categorizedcategorized into partitioning into partitioning methods, hierarchical methods, density-based methods, methods, hierarchical methods, density-based methods, grid-based methods, and model-based methodsgrid-based methods, and model-based methods

Outlier detectionOutlier detection and analysis are very useful for fraud and analysis are very useful for fraud detection, etc. and can be performed by statistical, detection, etc. and can be performed by statistical, distance-based or deviation-based approachesdistance-based or deviation-based approaches

There are still lots of research issues on cluster analysis, There are still lots of research issues on cluster analysis, such as such as constraint-based clusteringconstraint-based clustering


Other Classification Other Classification MethodsMethods

k-nearest neighbor classifierk-nearest neighbor classifier case-based reasoningcase-based reasoning Genetic algorithmGenetic algorithm Rough set approachRough set approach Fuzzy set approachesFuzzy set approaches


Instance-Based Instance-Based MethodsMethods

Instance-based learningInstance-based learning: : Store training examples and delay the processing

(“lazy evaluation”) until a new instance must be classified

Typical approachesTypical approaches k-nearest neighbor approach

Instances represented as points in a Euclidean space.

Locally weighted regressionConstructs local approximation

Case-based reasoningUses symbolic representations and knowledge-

based inference


The The kk-Nearest Neighbor -Nearest Neighbor AlgorithmAlgorithm

All instances correspond to All instances correspond to pointspoints in the in the n-dimensional space.n-dimensional space.

The nearest neighbor are defined in The nearest neighbor are defined in terms of terms of Euclidean distanceEuclidean distance..

The target function could be discrete- or The target function could be discrete- or real- valued.real- valued.


The The kk-Nearest Neighbor -Nearest Neighbor AlgorithmAlgorithm

For discrete-valued, the For discrete-valued, the kk-NN returns the -NN returns the most common valuemost common value among the k among the k training examples nearest totraining examples nearest to xxqq. .

Vonoroi diagramVonoroi diagram: the decision surface : the decision surface induced by 1-NN for a typical set of induced by 1-NN for a typical set of training examples.training examples.

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .


Discussion (1)Discussion (1)

The k-NN algorithm for The k-NN algorithm for continuous-valuedcontinuous-valued target target functionsfunctions Calculate the mean values of the k nearest neighbors

Distance-weighted nearest neighbor algorithmDistance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors

according to their distance to the query point xq,giving greater weight to closer neighbors

Similarly, for real-valued target functionsw

d xq xi 1

2( , )


Discussion (2)Discussion (2)

Robust to Robust to noisy datanoisy data by averaging k-nearest by averaging k-nearest neighborsneighbors

Curse of dimensionalityCurse of dimensionality: distance between : distance between neighbors could be dominated by irrelevant neighbors could be dominated by irrelevant attributes. attributes. To overcome it, axes stretch or elimination of the least

relevant attributes.


BibliografiaBibliografia

Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques, J. , J. Han & M. Kamber, Morgan Kaufmann, Han & M. Kamber, Morgan Kaufmann, 2001 (Sect. 7.7.1 e Cap. 8)2001 (Sect. 7.7.1 e Cap. 8)

Segmentação ( Clustering ) (baseado nos slides do Han)

Documents

Transcript of Segmentação ( Clustering ) (baseado nos slides do Han)