Clustering Algorithms Based on Cost Function Optimization

CSCE 970 Lecture 10:Clustering Algorithms Based on Cost

Function Optimization

Stephen D. Scott

April 9, 2001

Introduction

• A very popular family of clustering algorithms

• General procedure: Define a cost function J

measuring the goodness of a clustering and

search for a parameter vector θ to

minimize/maximize J

• Search is subject to certain constraints, e.g.

that the result fits the definition of a clustering

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

resentatives of the m clusters

• θ can also be a vector of parameters for m

hyperplanar or hyperspherical representatives

• Typically algorithms assume m is known, so

may have to try several values in practice

• Skipping Bayesian-style approaches (Sec. 14.2)

and focusing on hard c-means (Isodata) and

fuzzy c-means algorithms

Hard Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij is 1 if f.v. xi is in cluster Cj and 0 otherwise

• To meet definition of hard clustering,

∀i ∈ {1, . . . , N} needm∑

uij = 1 and uij ∈ {0,1}

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

J (θ, U) =N∑

uij d(

xi, θj

• Globally minimizing J given a set of f.v.’s means

finding both a set of representatives θ and as-

signment to clusters U that minimizes DM be-

tween each f.v. and its representative

Minimizing J

• For a fixed θ, J is minimized and constraints

met iff

1 if d(

xi, θj

= min1≤k≤m {d (xi, θk)}

0 otherwise

• For a fixed U , minimize J by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

xi, θj

∂θj= 0

• Can alternate between the above steps until a

termination condition is met

Stephen D. Scott

April 9, 2001

Introduction

minimize/maximize J

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

Definitions

∀i ∈ {1, . . . , N} needm∑

uij = 1 and uij ∈ {0,1}

J (θ, U) =N∑

uij d(

xi, θj

Minimizing J

met iff

1 if d(

xi, θj

0 otherwise

xi, θj

∂θj= 0

Stephen D. Scott

April 9, 2001

Introduction

minimize/maximize J

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

Definitions

∀i ∈ {1, . . . , N} needm∑

uij = 1 and uij ∈ {0,1}

J (θ, U) =N∑

uij d(

xi, θj

Minimizing J

met iff

1 if d(

xi, θj

0 otherwise

xi, θj

∂θj= 0

Stephen D. Scott

April 9, 2001

Introduction

minimize/maximize J

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

Definitions

∀i ∈ {1, . . . , N} needm∑

uij = 1 and uij ∈ {0,1}

J (θ, U) =N∑

uij d(

xi, θj

Minimizing J

met iff

1 if d(

xi, θj

0 otherwise

xi, θj

∂θj= 0

Generalized Hard Algorithmic Scheme (GHAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign f.v.’s to clusters)

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

0 otherwise

– t = t + 1

– For j = 1 to m, solve

uij(t− 1)∂d(

xi, θj

∂θj= 0 (1)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

Isodata (a.k.a. Hard k-Means

a.k.a. Hard c-Means) Algorithm

• Special case of GHAS: Each θj is point rep. of

Cj and DM is squared Euclidean distance

• So (1) becomes

uij(t− 1)∂

xik − θjk

∂θj= 0

uij(t − 1)

θj1 − xi1

θj` − xi`

uij(t− 1) = 2N∑

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

xi = Cj(t − 1)’s mean vector

Isodata Algorithm

Pseudocode

• Repeat until ‖θ(t)− θ(t − 1)‖ = 0

– For i = 1 to N

∗ Find closest rep. for xi, say θj, and set

b(i) = j

– For j = 1 to m

∗ Set θj = mean of {xi ∈ X : b(i) = j}

• Guaranteed to converge to global minimum of

J if squared Euclidean distance used

• If e.g. Euclidean distance used, cannot guar-

antee this

Fuzzy Clustering Algorithms

Definitions

uij quantifies the fuzzy membership of xi in

cluster Cj

• To meet definition of fuzzy clustering,

∀i ∈ {1, . . . , N} needm∑

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

uij < N and ∀i, j need uij ∈ [0,1]

Jq (θ, U) =N∑

uqij d

xi, θj

where q is a fuzzifier that, when > 1, allows

for fuzzy clusterings to have lower cost than

hard clusterings (Example 14.4, pp. 453–454)

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

0 otherwise

– t = t + 1

uij(t− 1)∂d(

xi, θj

∂θj= 0 (1)

• So (1) becomes

uij(t− 1)∂

xik − θjk

∂θj= 0

uij(t − 1)

θj1 − xi1

θj` − xi`

uij(t− 1) = 2N∑

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

Isodata Algorithm

Pseudocode

– For i = 1 to N

b(i) = j

– For j = 1 to m

antee this

Definitions

cluster Cj

∀i ∈ {1, . . . , N} needm∑

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

Jq (θ, U) =N∑

uqij d

xi, θj

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

0 otherwise

– t = t + 1

uij(t− 1)∂d(

xi, θj

∂θj= 0 (1)

• So (1) becomes

uij(t− 1)∂

xik − θjk

∂θj= 0

uij(t − 1)

θj1 − xi1

θj` − xi`

uij(t− 1) = 2N∑

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

Isodata Algorithm

Pseudocode

– For i = 1 to N

b(i) = j

– For j = 1 to m

antee this

Definitions

cluster Cj

∀i ∈ {1, . . . , N} needm∑

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

Jq (θ, U) =N∑

uqij d

xi, θj

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

0 otherwise

– t = t + 1

uij(t− 1)∂d(

xi, θj

∂θj= 0 (1)

• So (1) becomes

uij(t− 1)∂

xik − θjk

∂θj= 0

uij(t − 1)

θj1 − xi1

θj` − xi`

uij(t− 1) = 2N∑

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

Isodata Algorithm

Pseudocode

– For i = 1 to N

b(i) = j

– For j = 1 to m

antee this

Definitions

cluster Cj

∀i ∈ {1, . . . , N} needm∑

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

Jq (θ, U) =N∑

uqij d

xi, θj

Minimizing Jq

• When fixing θ and minimizing w.r.t. U while

satisfying constraints, cannot use simple method

from slide 4

• Instead, use Lagrange multipliers (pp. 610–611)

to enforce constraint∑m

j=1 uij = 1∀i

J (θ, U) =

Jq︷︸︸︷

uqij d

xi, θj

=0 when∑m

j=1 uij = 1∀i︷︸︸︷

uij − 1

• Now minimize J w.r.t. U , yielding update equa-

tion in terms of λi’s, solve for λi’s using con-

straint, and end up with final update equation

Minimizing Jq (cont’d)

∂J (θ, U)

∂urs= q uq−1

rs d (xr, θs)− λr = 0

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

urj = 1

xr, θj

1/(q−1)

λr =q

∑mj=1

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

• For a fixed U , minimize Jq by minimizing w.r.t.

∂J(θ, U)

∂θj=

xi, θj

∂θj= 0

• As with hard clustering scheme, alternate be-

tween U and θ until a termination condition is

Generalized Fuzzy Algorithmic Scheme (GFAS)

– For i = 1 to N (assign memb. values to f.v.’s)

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

uqij(t − 1)

xi, θj

∂θj= 0 (2)

Minimizing Jq

from slide 4

j=1 uij = 1∀i

J (θ, U) =

Jq︷︸︸︷

uqij d

xi, θj

=0 when∑m

j=1 uij = 1∀i︷︸︸︷

uij − 1

∂J (θ, U)

∂urs= q uq−1

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

urj = 1

xr, θj

1/(q−1)

λr =q

∑mj=1

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

∂J(θ, U)

∂θj=

xi, θj

∂θj= 0

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

uqij(t − 1)

xi, θj

∂θj= 0 (2)

Minimizing Jq

from slide 4

j=1 uij = 1∀i

J (θ, U) =

Jq︷︸︸︷

uqij d

xi, θj

=0 when∑m

j=1 uij = 1∀i︷︸︸︷

uij − 1

∂J (θ, U)

∂urs= q uq−1

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

urj = 1

xr, θj

1/(q−1)

λr =q

∑mj=1

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

∂J(θ, U)

∂θj=

xi, θj

∂θj= 0

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

uqij(t − 1)

xi, θj

∂θj= 0 (2)

Minimizing Jq

from slide 4

j=1 uij = 1∀i

J (θ, U) =

Jq︷︸︸︷

uqij d

xi, θj

=0 when∑m

j=1 uij = 1∀i︷︸︸︷

uij − 1

∂J (θ, U)

∂urs= q uq−1

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

urj = 1

xr, θj

1/(q−1)

λr =q

∑mj=1

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

∂J(θ, U)

∂θj=

xi, θj

∂θj= 0

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

uqij(t − 1)

xi, θj

∂θj= 0 (2)

Fuzzy c-Means (a.k.a. Fuzzy k-Means)

Algorithm

• If we use GFAS with θj = point rep. of Cj and

DM = squared Euclidean dist., (2) becomes

uqij(t − 1)

θj − xi

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

• Convergence guarantees?

• Instead of squared Euclidean distance, can use

xi, θj

xi − θj

where A is symmetric and positive definite

• Use of (4) doesn’t change (3)

Possibilistic Clustering Algorithms

• Similar to fuzzy clustering algorithms, but don’t

requirem∑

uij = 1 ∀i ∈ {1, . . . , N}

• Still need uij ∈ [0,1] ∀i, j and 0 <N∑

uij ≤ N

∀j ∈ {1, . . . , m}

• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}

• Instead of measuring degree of membership of

xi in Cj, now uij measures degree of compatability,

i.e. the possibility that xi belongs in Cj

• Cannot directly use fuzzy cost function of slide

8 since it’s trivially minimized with U = 0,

violating our new constraint, so use:

J(θ, U) =

original cost func.︷︸︸︷

uqij d

xi, θj

decreases as uij’s increase︷︸︸︷

1− uij

where ηj > 0 ∀j

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

xi, θj

− qηj

1− uij

)q−1= 0

uij =1

(d(xi,θj)

)1/(q−1)(5)

• Updating for θ is same as for GFAS:

uqij(t − 1)

xi, θj

∂θj= 0

• Setting ηj’s can be done by running GFAS then

taking a weighted average of DM between xi’s

and θj:

∑Ni=1 u

xi, θj

∑Ni=1 u

[then run GPAS after setting ηj’s]

Benefit: Less Sensitivity to Outliers

• Since uij is inversely proportional to d(

xi, θj

updates to θ are less sensitive to outliers

d Distances within "natural clusters"<< d = distance to outlier

• For GHAS (slide 5), uij = 1 for one cluster, 0

for others

• For GFAS (slide 12), uij = 1/m = 1/4 for each

cluster

• For GPAS, uij gets arbitrarily small as d grows,

and is independent of distances to other θk’s

Algorithm

uqij(t − 1)

θj − xi

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

xi, θj

xi − θj

requirem∑

uij = 1 ∀i ∈ {1, . . . , N}

uij ≤ N

∀j ∈ {1, . . . , m}

J(θ, U) =

uqij d

xi, θj

1− uij

where ηj > 0 ∀j

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

xi, θj

− qηj

1− uij

)q−1= 0

uij =1

(d(xi,θj)

)1/(q−1)(5)

uqij(t − 1)

xi, θj

∂θj= 0

and θj:

∑Ni=1 u

xi, θj

∑Ni=1 u

xi, θj

for others

cluster

Algorithm

uqij(t − 1)

θj − xi

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

xi, θj

xi − θj

requirem∑

uij = 1 ∀i ∈ {1, . . . , N}

uij ≤ N

∀j ∈ {1, . . . , m}

J(θ, U) =

uqij d

xi, θj

1− uij

where ηj > 0 ∀j

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

xi, θj

− qηj

1− uij

)q−1= 0

uij =1

(d(xi,θj)

)1/(q−1)(5)

uqij(t − 1)

xi, θj

∂θj= 0

and θj:

∑Ni=1 u

xi, θj

∑Ni=1 u

xi, θj

for others

cluster

Algorithm

uqij(t − 1)

θj − xi

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

xi, θj

xi − θj

requirem∑

uij = 1 ∀i ∈ {1, . . . , N}

uij ≤ N

∀j ∈ {1, . . . , m}

J(θ, U) =

uqij d

xi, θj

1− uij

where ηj > 0 ∀j

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

xi, θj

− qηj

1− uij

)q−1= 0

uij =1

(d(xi,θj)

)1/(q−1)(5)

uqij(t − 1)

xi, θj

∂θj= 0

and θj:

∑Ni=1 u

xi, θj

∑Ni=1 u

xi, θj

for others

cluster

Benefit: Mode-Seeking Property

• Can write cost function as J(θ, U) =m∑

Jj for

Jj =N∑

uqij d

xi, θj

1− uij

and get same uij updates by minimizing Jj’s

individually

• Rewriting (5) as d(

xi, θj

1− uij

)q−1

substituting into (6) gives

Jj =N∑

uqijηj

1− uij

)q−1

1− uij

)q−1 (uij + 1− uij

1− uij

)q−1

• So minimizing J ⇒ minimizing Jj’s ⇒

maximizing uij’s ⇒ minimizing d(

xi, θj

for each

Benefit: Mode-Seeking Property (cont’d)

• Thus GPAS seeks out regions dense in f.v.’s,

i.e. if m > k = number of natural clusters, then

properly initialized GPAS will have coincident

clusters

• So m need not be known a priori, only upper

bound and proper initialization

Benefit: Mode-Seeking Property

• Can write cost function as J(θ, U) =m∑

Jj for

Jj =N∑

uqij d

xi, θj

1− uij

and get same uij updates by minimizing Jj’s

individually

• Rewriting (5) as d(

xi, θj

1− uij

)q−1

substituting into (6) gives

Jj =N∑

uqijηj

1− uij

)q−1

1− uij

)q−1 (uij + 1− uij

1− uij

)q−1

• So minimizing J ⇒ minimizing Jj’s ⇒

maximizing uij’s ⇒ minimizing d(

xi, θj

for each

Benefit: Mode-Seeking Property (cont’d)

• Thus GPAS seeks out regions dense in f.v.’s,

i.e. if m > k = number of natural clusters, then

properly initialized GPAS will have coincident

clusters

• So m need not be known a priori, only upper

bound and proper initialization

Clustering Algorithms Based on Cost Function Optimization

Documents

Transcript of Clustering Algorithms Based on Cost Function Optimization