Classification & Clustering

60
Computer Science Universite itMaastric ht Institute for Knowledge and Agent Technology Classification & Clustering Pieter Spronck http://www.cs.unimaas.nl/p.spronck

description

Classification & Clustering. Pieter Spronck http://www.cs.unimaas.nl/p.spronck. Binary Division of Marbles. Big vs. Small. Transparent vs. Opaque. Marble Attributes. Size (big vs. small) Transparency (transparent vs. opaque) Shininess (shiny vs. dull) - PowerPoint PPT Presentation

Transcript of Classification & Clustering

Page 1: Classification & Clustering

ComputerScience

UniversiteitMaastricht

Institute for Knowledgeand Agent Technology

Classification & Clustering

Pieter Spronckhttp://www.cs.unimaas.nl/p.spronck

Page 2: Classification & Clustering

219 Apr 2023

Binary Division of Marbles

Page 3: Classification & Clustering

319 Apr 2023

Big vs. Small

Page 4: Classification & Clustering

419 Apr 2023

Transparent vs. Opaque

Page 5: Classification & Clustering

519 Apr 2023

Marble Attributes

Size (big vs. small)Transparency (transparent vs. opaque)Shininess (shiny vs. dull)Colouring (monochrome vs. polychrome)Colour (blue, green, yellow, …)…

Page 6: Classification & Clustering

619 Apr 2023

Grouping of Marbles

Page 7: Classification & Clustering

719 Apr 2023

“Marbles”

Page 8: Classification & Clustering

819 Apr 2023

“Honouring All Distinctions”

Page 9: Classification & Clustering

919 Apr 2023

“Colour Coding”

Page 10: Classification & Clustering

1019 Apr 2023

if transparent then if coloured glass

then group 1else group 3

else group 2

1

2

3

“Natural Grouping”

Page 11: Classification & Clustering

1119 Apr 2023

Types of Clusters

Uniquely classifying clustersOverlapping clustersProbabilistic clustersDendrograms

Page 12: Classification & Clustering

1219 Apr 2023

Uniquely Classifying Clusters

Page 13: Classification & Clustering

1319 Apr 2023

Overlapping Clusters

Page 14: Classification & Clustering

1419 Apr 2023

Probabilistic ClusteringCluster Green Blue Typical

Samples

1 1.0 0.0

2 0.0 1.0

3 0.1 0.9

4 0.5 0.5

Page 15: Classification & Clustering

1519 Apr 2023

Dendrogramopaque

transparent

not clear clear

Page 16: Classification & Clustering

1619 Apr 2023

Classification

Ordering of entities into groups based on their similarityMinimisation of within-group varianceMaximisation of between-group varianceExhaustive and exclusivePrincipal technique: clustering

Page 17: Classification & Clustering

1719 Apr 2023

Reasons for Classification

Descriptive powerParsimonyMaintainabilityVersatilityIdentification of distinctive attributes

Page 18: Classification & Clustering

1819 Apr 2023

Typology vs. Taxonomy

Typology – conceptualTaxonomy – empirical

Page 19: Classification & Clustering

1919 Apr 2023

Typology

Define conceptual attributesSelect appropriate attributes Create typology matrix (substruction)Insert empirical entities in matrixExtend matrix if necessaryReduce matrix if necessary

Page 20: Classification & Clustering

2019 Apr 2023

Defining Conceptual Attributes

MeaningfulFocus on ideal typesOrder of importanceExhaustive domains

Page 21: Classification & Clustering

2119 Apr 2023

Conceptual Marble Attributes

Page 22: Classification & Clustering

2219 Apr 2023

Typology Matrix

Transparency

ColouringOpaque Transparent

Monochrome

Polychrome

Page 23: Classification & Clustering

2319 Apr 2023

Matrix Extension

Transparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

Page 24: Classification & Clustering

2419 Apr 2023

Reduction

Functional reductionPragmatic reductionNumerical reductionReduction by using criterion types

Page 25: Classification & Clustering

2519 Apr 2023

Functional Reduction

Transparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

Page 26: Classification & Clustering

2619 Apr 2023

Functionally Reduced Matrix

Transparency

Colouring

Transparent

OpaqueClear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

Page 27: Classification & Clustering

2719 Apr 2023

Pragmatic Reduction

Transparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

Page 28: Classification & Clustering

2819 Apr 2023

Pragmatically Reduced Matrix

Transparency

Size

Transparent

OpaqueClear Not clear

Small

Monochrome

Polychrome

Big

GlassColouring

Page 29: Classification & Clustering

2919 Apr 2023

Criticising Typological Classification

ReificationResilienceProblematic attribute selectionUnmanageability

Page 30: Classification & Clustering

3019 Apr 2023

Taxonomy

Define empirical attributesSelect appropriate attributesCreate entity matrixApply clustering techniqueAnalyse clusters

Page 31: Classification & Clustering

3119 Apr 2023

Empirical Attributes

Big

Single colour

Lots of colours

Green glass

Transparent

Blue

Yellow

WhiteDull

Shiny

Page 32: Classification & Clustering

3219 Apr 2023

Selecting Attributes

Size (big/small)Colour (yellow, green, blue, red, white…)Colouring (monochrome/polychrome)Shininess (shiny/dull)Transparency (transparent/opaque)Glass colour (clear, green, …)

Page 33: Classification & Clustering

3319 Apr 2023

Entity MatrixBig Monochrome Shiny Transparent Big Monochrome Shiny Transparent

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N N N N N N Y Y

Y N N N Y N Y Y

Y Y Y N

Page 34: Classification & Clustering

3419 Apr 2023

Automatic Clustering Parameters

Agglomerative vs. divisiveMonothetic vs. polytheticOutliers permittedLimits to number of clustersForm of linkage (single, complete, average)

Page 35: Classification & Clustering

3519 Apr 2023

Automatic Clustering

NYYYsmall, monochrome, shiny, transparent

*NNNpolychrome, dull, opaque

*NYYpolychrome, shiny, transparent

YYYNbig, monochrome, shiny, opaque

NYYNsmall, monochrome, shiny, opaque

Page 36: Classification & Clustering

3619 Apr 2023

Polythetic to Monothetic

NYYYsmall, monochrome,shiny, transparent

*NNNpolychrome, dull, opaque

*NYYpolychrome, shiny, transparent

NYYNsmall, monochrome,

shiny, opaque

*YYNmonochrome, shiny, opaque

Page 37: Classification & Clustering

3719 Apr 2023

Analysing Clusters

“Vanilla”

“Stone”

“Tiger”

“Classic”

small, monochrome,shiny, transparent polychrome, dull,

opaque

polychrome, shiny,transparent

small, monochrome,shiny, opaque

Page 38: Classification & Clustering

3819 Apr 2023

Criticising Taxonomical Classification

Dependent on specimensDifficult to generaliseDifficult to labelBiased towards academic disciplineNot the “last word”

Page 39: Classification & Clustering

3919 Apr 2023

Typology vs. Taxonomy

Typology Taxonomy

Conceptual Empirical

Subjective Objective

Manual (Mostly) automatic

Less discriminative More discriminative

Goes awry when there are insufficient insights

Goes awry when there are insufficient specimens

Page 40: Classification & Clustering

4019 Apr 2023

Operational Classification

Typology(conceptual)

Taxonomy(empirical)

Operational typology(conceptual + empirical)

Page 41: Classification & Clustering

4119 Apr 2023

Automated Clustering Methods

Iterative distance-based clustering: the k-means methodIncremental clustering:the Cobweb methodProbability-based clustering:the EM algorithm

Page 42: Classification & Clustering

4219 Apr 2023

k-Means Method

Iterative distance-based clustering DivisivePolytheticPredefined number of clusters (k)Outliers permitted

Page 43: Classification & Clustering

4319 Apr 2023

k-Means (pass 1)

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

?

?

Page 44: Classification & Clustering

4419 Apr 2023

k-Means (pass 2)

Cluster average:small, monochrome,shiny, transparent.

Cluster average:small, polychrome,dull, opaque

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

Page 45: Classification & Clustering

4519 Apr 2023

k-Means (pass 3)

Cluster average:small, monochrome,shiny, transparent.

Cluster average:big, polychrome,dull, opaque

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

?

Page 46: Classification & Clustering

4619 Apr 2023

Cobweb Algorithm

Incremental clustering AgglomerativePolytheticDynamic number of clustersOutliers permitted

Page 47: Classification & Clustering

4719 Apr 2023

Cobweb Procedure

Builds a tree by adding instances to itUses a Category Utility function to determine the quality of the clusteringChanges the tree structure if this positively influences the Category Utility (by merging nodes or splitting nodes)“Cutoff” value may be used to group sufficiently similar instances together

Page 48: Classification & Clustering

4819 Apr 2023

Category Utility

Measure for quality of clusteringThe better the predictive value of the average attribute values of the instances in the clusters for the individual attribute values, the higher the CU will be

k

vaCvaCCCCU i j ijiiji

k

22

1

Pr|PrPr,...,

Page 49: Classification & Clustering

4919 Apr 2023

Category Utility for “Size” (1)

C1 C2

a) Pr[size=big|C1] = 1/3b) Pr[size=big|C2] = 1/3c) Pr[size=big] = 1/3d) Pr[C1] = 1/2

e) Pr[size=small|C1] = 2/3f) Pr[size=small|C2] = 2/3g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 = 0

Page 50: Classification & Clustering

5019 Apr 2023

Category Utility for “Size” (2)

C1 C2

a) Pr[size=big|C1] = 2/3b) Pr[size=big|C2] = 0c) Pr[size=big] = 1/3d) Pr[C1] = 1/2

e) Pr[size=small|C1] = 1/3f) Pr[size=small|C2] = 1g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 =

((1/2)((1/3)+(–1/3))+(1/2)((–1/9)+(5/9)))/2 = 1/9

Page 51: Classification & Clustering

5119 Apr 2023

Category Utility for “Size” (3)

C1 C2

a) Pr[size=big|C1] = 1b) Pr[size=big|C2] = 0c) Pr[size=big] = 1/3d) Pr[C1] = 1/3

e) Pr[size=small|C1] = 0f) Pr[size=small|C2] = 1g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 =

((1/3)((8/9)+(–4/9))+(2/3)((–1/9)+(5/9)))/2 = 2/9

Page 52: Classification & Clustering

5219 Apr 2023

Cobweb Example

12

attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

Page 53: Classification & Clustering

5319 Apr 2023

Cobweb Result Example

attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

Page 54: Classification & Clustering

5419 Apr 2023

Cobweb Numerical

Probability of values of attributes of instances in a cluster is based on the standard deviation from the estimate for the mean valueAcuity is presumed variance in attribute values

Page 55: Classification & Clustering

5519 Apr 2023

Disadvantages of Previous Methods

Fast and hard to judgeDependent on initial setupAd-hoc limitationsHard to escape from local minima

Page 56: Classification & Clustering

5619 Apr 2023

Probability-based Clustering

Finite mixture modelsEach cluster is defined by a vector of probabilities for instances to have certain values for their attributes, and a probability for instances to reside in the cluster. Clustering equals searching for optimal sets of probabilities for a sample set

Page 57: Classification & Clustering

5719 Apr 2023

Expectation-Maximisation (EM)

Probability-based clusteringDivisivePolytheticPredefined number of clusters (k)Outliers permitted

Page 58: Classification & Clustering

5819 Apr 2023

EM Procedure

Select k cluster vectors randomlyCalculate cluster probabilities for each instance (under the assumption that the instance attributes are independent)Use calculations to re-estimate valuesRepeat until increase in quality becomes negligible

Page 59: Classification & Clustering

5919 Apr 2023

EM Result Example

pC1=0.2pbig=0.6pmonochrome=0.3pshiny=0.4ptransparent=0.4

pC2=0.8 pbig=0.2pmonochrome=0.8pshiny=0.9ptransparent=0.5

.2*.4*.3*.4*.6=0.0058 .8*.8*.8*.9*.5=0.2304.2*.4*.7*.6*.6=0.0202 .8*.8*.2*.1*.5=0.0064.2*.6*.7*.4*.4=0.0134 .8*.2*.2*.9*.5=0.0144

Page 60: Classification & Clustering

6019 Apr 2023

The Essence of Classification

A successful classification defines fundamental characteristicsA classification can never be better than the attributes it is based upon

There is no magic formula