Classification & Clustering

Post on 31-Dec-2015

34 views 2 download

description

Classification & Clustering. Pieter Spronck http://www.cs.unimaas.nl/p.spronck. Binary Division of Marbles. Big vs. Small. Transparent vs. Opaque. Marble Attributes. Size (big vs. small) Transparency (transparent vs. opaque) Shininess (shiny vs. dull) - PowerPoint PPT Presentation

Transcript of Classification & Clustering

ComputerScience

UniversiteitMaastricht

Institute for Knowledgeand Agent Technology

Classification & Clustering

Pieter Spronckhttp://www.cs.unimaas.nl/p.spronck

219 Apr 2023

Binary Division of Marbles

319 Apr 2023

Big vs. Small

419 Apr 2023

Transparent vs. Opaque

519 Apr 2023

Marble Attributes

Size (big vs. small)Transparency (transparent vs. opaque)Shininess (shiny vs. dull)Colouring (monochrome vs. polychrome)Colour (blue, green, yellow, …)…

619 Apr 2023

Grouping of Marbles

719 Apr 2023

“Marbles”

819 Apr 2023

“Honouring All Distinctions”

919 Apr 2023

“Colour Coding”

1019 Apr 2023

if transparent then if coloured glass

then group 1else group 3

else group 2

1

2

3

“Natural Grouping”

1119 Apr 2023

Types of Clusters

Uniquely classifying clustersOverlapping clustersProbabilistic clustersDendrograms

1219 Apr 2023

Uniquely Classifying Clusters

1319 Apr 2023

Overlapping Clusters

1419 Apr 2023

Probabilistic ClusteringCluster Green Blue Typical

Samples

1 1.0 0.0

2 0.0 1.0

3 0.1 0.9

4 0.5 0.5

1519 Apr 2023

Dendrogramopaque

transparent

not clear clear

1619 Apr 2023

Classification

Ordering of entities into groups based on their similarityMinimisation of within-group varianceMaximisation of between-group varianceExhaustive and exclusivePrincipal technique: clustering

1719 Apr 2023

Reasons for Classification

Descriptive powerParsimonyMaintainabilityVersatilityIdentification of distinctive attributes

1819 Apr 2023

Typology vs. Taxonomy

Typology – conceptualTaxonomy – empirical

1919 Apr 2023

Typology

Define conceptual attributesSelect appropriate attributes Create typology matrix (substruction)Insert empirical entities in matrixExtend matrix if necessaryReduce matrix if necessary

2019 Apr 2023

Defining Conceptual Attributes

MeaningfulFocus on ideal typesOrder of importanceExhaustive domains

2119 Apr 2023

Conceptual Marble Attributes

2219 Apr 2023

Typology Matrix

Transparency

ColouringOpaque Transparent

Monochrome

Polychrome

2319 Apr 2023

Matrix Extension

Transparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

2419 Apr 2023

Reduction

Functional reductionPragmatic reductionNumerical reductionReduction by using criterion types

2519 Apr 2023

Functional Reduction

Transparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

2619 Apr 2023

Functionally Reduced Matrix

Transparency

Colouring

Transparent

OpaqueClear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

2719 Apr 2023

Pragmatic Reduction

Transparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

Monochrome

Big

Small

PolychromeBig

Small

GlassSize

2819 Apr 2023

Pragmatically Reduced Matrix

Transparency

Size

Transparent

OpaqueClear Not clear

Small

Monochrome

Polychrome

Big

GlassColouring

2919 Apr 2023

Criticising Typological Classification

ReificationResilienceProblematic attribute selectionUnmanageability

3019 Apr 2023

Taxonomy

Define empirical attributesSelect appropriate attributesCreate entity matrixApply clustering techniqueAnalyse clusters

3119 Apr 2023

Empirical Attributes

Big

Single colour

Lots of colours

Green glass

Transparent

Blue

Yellow

WhiteDull

Shiny

3219 Apr 2023

Selecting Attributes

Size (big/small)Colour (yellow, green, blue, red, white…)Colouring (monochrome/polychrome)Shininess (shiny/dull)Transparency (transparent/opaque)Glass colour (clear, green, …)

3319 Apr 2023

Entity MatrixBig Monochrome Shiny Transparent Big Monochrome Shiny Transparent

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N Y Y Y N Y Y N

N N N N N N Y Y

Y N N N Y N Y Y

Y Y Y N

3419 Apr 2023

Automatic Clustering Parameters

Agglomerative vs. divisiveMonothetic vs. polytheticOutliers permittedLimits to number of clustersForm of linkage (single, complete, average)

3519 Apr 2023

Automatic Clustering

NYYYsmall, monochrome, shiny, transparent

*NNNpolychrome, dull, opaque

*NYYpolychrome, shiny, transparent

YYYNbig, monochrome, shiny, opaque

NYYNsmall, monochrome, shiny, opaque

3619 Apr 2023

Polythetic to Monothetic

NYYYsmall, monochrome,shiny, transparent

*NNNpolychrome, dull, opaque

*NYYpolychrome, shiny, transparent

NYYNsmall, monochrome,

shiny, opaque

*YYNmonochrome, shiny, opaque

3719 Apr 2023

Analysing Clusters

“Vanilla”

“Stone”

“Tiger”

“Classic”

small, monochrome,shiny, transparent polychrome, dull,

opaque

polychrome, shiny,transparent

small, monochrome,shiny, opaque

3819 Apr 2023

Criticising Taxonomical Classification

Dependent on specimensDifficult to generaliseDifficult to labelBiased towards academic disciplineNot the “last word”

3919 Apr 2023

Typology vs. Taxonomy

Typology Taxonomy

Conceptual Empirical

Subjective Objective

Manual (Mostly) automatic

Less discriminative More discriminative

Goes awry when there are insufficient insights

Goes awry when there are insufficient specimens

4019 Apr 2023

Operational Classification

Typology(conceptual)

Taxonomy(empirical)

Operational typology(conceptual + empirical)

4119 Apr 2023

Automated Clustering Methods

Iterative distance-based clustering: the k-means methodIncremental clustering:the Cobweb methodProbability-based clustering:the EM algorithm

4219 Apr 2023

k-Means Method

Iterative distance-based clustering DivisivePolytheticPredefined number of clusters (k)Outliers permitted

4319 Apr 2023

k-Means (pass 1)

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

?

?

4419 Apr 2023

k-Means (pass 2)

Cluster average:small, monochrome,shiny, transparent.

Cluster average:small, polychrome,dull, opaque

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

4519 Apr 2023

k-Means (pass 3)

Cluster average:small, monochrome,shiny, transparent.

Cluster average:big, polychrome,dull, opaque

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

?

4619 Apr 2023

Cobweb Algorithm

Incremental clustering AgglomerativePolytheticDynamic number of clustersOutliers permitted

4719 Apr 2023

Cobweb Procedure

Builds a tree by adding instances to itUses a Category Utility function to determine the quality of the clusteringChanges the tree structure if this positively influences the Category Utility (by merging nodes or splitting nodes)“Cutoff” value may be used to group sufficiently similar instances together

4819 Apr 2023

Category Utility

Measure for quality of clusteringThe better the predictive value of the average attribute values of the instances in the clusters for the individual attribute values, the higher the CU will be

k

vaCvaCCCCU i j ijiiji

k

22

1

Pr|PrPr,...,

4919 Apr 2023

Category Utility for “Size” (1)

C1 C2

a) Pr[size=big|C1] = 1/3b) Pr[size=big|C2] = 1/3c) Pr[size=big] = 1/3d) Pr[C1] = 1/2

e) Pr[size=small|C1] = 2/3f) Pr[size=small|C2] = 2/3g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 = 0

5019 Apr 2023

Category Utility for “Size” (2)

C1 C2

a) Pr[size=big|C1] = 2/3b) Pr[size=big|C2] = 0c) Pr[size=big] = 1/3d) Pr[C1] = 1/2

e) Pr[size=small|C1] = 1/3f) Pr[size=small|C2] = 1g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 =

((1/2)((1/3)+(–1/3))+(1/2)((–1/9)+(5/9)))/2 = 1/9

5119 Apr 2023

Category Utility for “Size” (3)

C1 C2

a) Pr[size=big|C1] = 1b) Pr[size=big|C2] = 0c) Pr[size=big] = 1/3d) Pr[C1] = 1/3

e) Pr[size=small|C1] = 0f) Pr[size=small|C2] = 1g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 =

((1/3)((8/9)+(–4/9))+(2/3)((–1/9)+(5/9)))/2 = 2/9

5219 Apr 2023

Cobweb Example

12

attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

5319 Apr 2023

Cobweb Result Example

attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

5419 Apr 2023

Cobweb Numerical

Probability of values of attributes of instances in a cluster is based on the standard deviation from the estimate for the mean valueAcuity is presumed variance in attribute values

5519 Apr 2023

Disadvantages of Previous Methods

Fast and hard to judgeDependent on initial setupAd-hoc limitationsHard to escape from local minima

5619 Apr 2023

Probability-based Clustering

Finite mixture modelsEach cluster is defined by a vector of probabilities for instances to have certain values for their attributes, and a probability for instances to reside in the cluster. Clustering equals searching for optimal sets of probabilities for a sample set

5719 Apr 2023

Expectation-Maximisation (EM)

Probability-based clusteringDivisivePolytheticPredefined number of clusters (k)Outliers permitted

5819 Apr 2023

EM Procedure

Select k cluster vectors randomlyCalculate cluster probabilities for each instance (under the assumption that the instance attributes are independent)Use calculations to re-estimate valuesRepeat until increase in quality becomes negligible

5919 Apr 2023

EM Result Example

pC1=0.2pbig=0.6pmonochrome=0.3pshiny=0.4ptransparent=0.4

pC2=0.8 pbig=0.2pmonochrome=0.8pshiny=0.9ptransparent=0.5

.2*.4*.3*.4*.6=0.0058 .8*.8*.8*.9*.5=0.2304.2*.4*.7*.6*.6=0.0202 .8*.8*.2*.1*.5=0.0064.2*.6*.7*.4*.4=0.0134 .8*.2*.2*.9*.5=0.0144

6019 Apr 2023

The Essence of Classification

A successful classification defines fundamental characteristicsA classification can never be better than the attributes it is based upon

There is no magic formula