Clustering Algorithms Based on Cost Function Optimization

18
CSCE 970 Lecture 10: Clustering Algorithms Based on Cost Function Optimization Stephen D. Scott April 9, 2001 1

Transcript of Clustering Algorithms Based on Cost Function Optimization

Page 1: Clustering Algorithms Based on Cost Function Optimization

CSCE 970 Lecture 10:Clustering Algorithms Based on Cost

Function Optimization

Stephen D. Scott

April 9, 2001

1

Introduction

• A very popular family of clustering algorithms

• General procedure: Define a cost function J

measuring the goodness of a clustering and

search for a parameter vector θ to

minimize/maximize J

• Search is subject to certain constraints, e.g.

that the result fits the definition of a clustering

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

resentatives of the m clusters

• θ can also be a vector of parameters for m

hyperplanar or hyperspherical representatives

• Typically algorithms assume m is known, so

may have to try several values in practice

• Skipping Bayesian-style approaches (Sec. 14.2)

and focusing on hard c-means (Isodata) and

fuzzy c-means algorithms

2

Hard Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij is 1 if f.v. xi is in cluster Cj and 0 otherwise

• To meet definition of hard clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1 and uij ∈ {0,1}

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

J (θ, U) =N∑

i=1

m∑

j=1

uij d(

xi, θj

)

• Globally minimizing J given a set of f.v.’s means

finding both a set of representatives θ and as-

signment to clusters U that minimizes DM be-

tween each f.v. and its representative

3

Hard Clustering Algorithms

Minimizing J

• For a fixed θ, J is minimized and constraints

met iff

uij =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

• For a fixed U , minimize J by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

N∑

i=1

uij

∂d(

xi, θj

)

∂θj= 0

• Can alternate between the above steps until a

termination condition is met

4

Page 2: Clustering Algorithms Based on Cost Function Optimization

CSCE 970 Lecture 10:Clustering Algorithms Based on Cost

Function Optimization

Stephen D. Scott

April 9, 2001

1

Introduction

• A very popular family of clustering algorithms

• General procedure: Define a cost function J

measuring the goodness of a clustering and

search for a parameter vector θ to

minimize/maximize J

• Search is subject to certain constraints, e.g.

that the result fits the definition of a clustering

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

resentatives of the m clusters

• θ can also be a vector of parameters for m

hyperplanar or hyperspherical representatives

• Typically algorithms assume m is known, so

may have to try several values in practice

• Skipping Bayesian-style approaches (Sec. 14.2)

and focusing on hard c-means (Isodata) and

fuzzy c-means algorithms

2

Hard Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij is 1 if f.v. xi is in cluster Cj and 0 otherwise

• To meet definition of hard clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1 and uij ∈ {0,1}

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

J (θ, U) =N∑

i=1

m∑

j=1

uij d(

xi, θj

)

• Globally minimizing J given a set of f.v.’s means

finding both a set of representatives θ and as-

signment to clusters U that minimizes DM be-

tween each f.v. and its representative

3

Hard Clustering Algorithms

Minimizing J

• For a fixed θ, J is minimized and constraints

met iff

uij =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

• For a fixed U , minimize J by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

N∑

i=1

uij

∂d(

xi, θj

)

∂θj= 0

• Can alternate between the above steps until a

termination condition is met

4

Page 3: Clustering Algorithms Based on Cost Function Optimization

CSCE 970 Lecture 10:Clustering Algorithms Based on Cost

Function Optimization

Stephen D. Scott

April 9, 2001

1

Introduction

• A very popular family of clustering algorithms

• General procedure: Define a cost function J

measuring the goodness of a clustering and

search for a parameter vector θ to

minimize/maximize J

• Search is subject to certain constraints, e.g.

that the result fits the definition of a clustering

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

resentatives of the m clusters

• θ can also be a vector of parameters for m

hyperplanar or hyperspherical representatives

• Typically algorithms assume m is known, so

may have to try several values in practice

• Skipping Bayesian-style approaches (Sec. 14.2)

and focusing on hard c-means (Isodata) and

fuzzy c-means algorithms

2

Hard Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij is 1 if f.v. xi is in cluster Cj and 0 otherwise

• To meet definition of hard clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1 and uij ∈ {0,1}

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

J (θ, U) =N∑

i=1

m∑

j=1

uij d(

xi, θj

)

• Globally minimizing J given a set of f.v.’s means

finding both a set of representatives θ and as-

signment to clusters U that minimizes DM be-

tween each f.v. and its representative

3

Hard Clustering Algorithms

Minimizing J

• For a fixed θ, J is minimized and constraints

met iff

uij =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

• For a fixed U , minimize J by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

N∑

i=1

uij

∂d(

xi, θj

)

∂θj= 0

• Can alternate between the above steps until a

termination condition is met

4

Page 4: Clustering Algorithms Based on Cost Function Optimization

CSCE 970 Lecture 10:Clustering Algorithms Based on Cost

Function Optimization

Stephen D. Scott

April 9, 2001

1

Introduction

• A very popular family of clustering algorithms

• General procedure: Define a cost function J

measuring the goodness of a clustering and

search for a parameter vector θ to

minimize/maximize J

• Search is subject to certain constraints, e.g.

that the result fits the definition of a clustering

(slide 7.8)

• E.g. θ =[

mT1 ,mT

2 , . . . ,mTm

]T= the point rep-

resentatives of the m clusters

• θ can also be a vector of parameters for m

hyperplanar or hyperspherical representatives

• Typically algorithms assume m is known, so

may have to try several values in practice

• Skipping Bayesian-style approaches (Sec. 14.2)

and focusing on hard c-means (Isodata) and

fuzzy c-means algorithms

2

Hard Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij is 1 if f.v. xi is in cluster Cj and 0 otherwise

• To meet definition of hard clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1 and uij ∈ {0,1}

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

J (θ, U) =N∑

i=1

m∑

j=1

uij d(

xi, θj

)

• Globally minimizing J given a set of f.v.’s means

finding both a set of representatives θ and as-

signment to clusters U that minimizes DM be-

tween each f.v. and its representative

3

Hard Clustering Algorithms

Minimizing J

• For a fixed θ, J is minimized and constraints

met iff

uij =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

• For a fixed U , minimize J by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

N∑

i=1

uij

∂d(

xi, θj

)

∂θj= 0

• Can alternate between the above steps until a

termination condition is met

4

Page 5: Clustering Algorithms Based on Cost Function Optimization

Hard Clustering Algorithms

Generalized Hard Algorithmic Scheme (GHAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign f.v.’s to clusters)

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uij(t− 1)∂d(

xi, θj

)

∂θj= 0 (1)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

5

Isodata (a.k.a. Hard k-Means

a.k.a. Hard c-Means) Algorithm

• Special case of GHAS: Each θj is point rep. of

Cj and DM is squared Euclidean distance

• So (1) becomes

N∑

i=1

uij(t− 1)∂

(∑`

k=1

(

xik − θjk

)2)

∂θj= 0

N∑

i=1

uij(t − 1)

2(

θj1 − xi1

)

...

2(

θj` − xi`

)

= 0

2θj

N∑

i=1

uij(t− 1) = 2N∑

i=1

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

xi = Cj(t − 1)’s mean vector

6

Isodata Algorithm

Pseudocode

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until ‖θ(t)− θ(t − 1)‖ = 0

– For i = 1 to N

∗ Find closest rep. for xi, say θj, and set

b(i) = j

– For j = 1 to m

∗ Set θj = mean of {xi ∈ X : b(i) = j}

• Guaranteed to converge to global minimum of

J if squared Euclidean distance used

• If e.g. Euclidean distance used, cannot guar-

antee this

7

Fuzzy Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij quantifies the fuzzy membership of xi in

cluster Cj

• To meet definition of fuzzy clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

i=1

uij < N and ∀i, j need uij ∈ [0,1]

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

Jq (θ, U) =N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

,

where q is a fuzzifier that, when > 1, allows

for fuzzy clusterings to have lower cost than

hard clusterings (Example 14.4, pp. 453–454)

8

Page 6: Clustering Algorithms Based on Cost Function Optimization

Hard Clustering Algorithms

Generalized Hard Algorithmic Scheme (GHAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign f.v.’s to clusters)

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uij(t− 1)∂d(

xi, θj

)

∂θj= 0 (1)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

5

Isodata (a.k.a. Hard k-Means

a.k.a. Hard c-Means) Algorithm

• Special case of GHAS: Each θj is point rep. of

Cj and DM is squared Euclidean distance

• So (1) becomes

N∑

i=1

uij(t− 1)∂

(∑`

k=1

(

xik − θjk

)2)

∂θj= 0

N∑

i=1

uij(t − 1)

2(

θj1 − xi1

)

...

2(

θj` − xi`

)

= 0

2θj

N∑

i=1

uij(t− 1) = 2N∑

i=1

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

xi = Cj(t − 1)’s mean vector

6

Isodata Algorithm

Pseudocode

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until ‖θ(t)− θ(t − 1)‖ = 0

– For i = 1 to N

∗ Find closest rep. for xi, say θj, and set

b(i) = j

– For j = 1 to m

∗ Set θj = mean of {xi ∈ X : b(i) = j}

• Guaranteed to converge to global minimum of

J if squared Euclidean distance used

• If e.g. Euclidean distance used, cannot guar-

antee this

7

Fuzzy Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij quantifies the fuzzy membership of xi in

cluster Cj

• To meet definition of fuzzy clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

i=1

uij < N and ∀i, j need uij ∈ [0,1]

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

Jq (θ, U) =N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

,

where q is a fuzzifier that, when > 1, allows

for fuzzy clusterings to have lower cost than

hard clusterings (Example 14.4, pp. 453–454)

8

Page 7: Clustering Algorithms Based on Cost Function Optimization

Hard Clustering Algorithms

Generalized Hard Algorithmic Scheme (GHAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign f.v.’s to clusters)

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uij(t− 1)∂d(

xi, θj

)

∂θj= 0 (1)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

5

Isodata (a.k.a. Hard k-Means

a.k.a. Hard c-Means) Algorithm

• Special case of GHAS: Each θj is point rep. of

Cj and DM is squared Euclidean distance

• So (1) becomes

N∑

i=1

uij(t− 1)∂

(∑`

k=1

(

xik − θjk

)2)

∂θj= 0

N∑

i=1

uij(t − 1)

2(

θj1 − xi1

)

...

2(

θj` − xi`

)

= 0

2θj

N∑

i=1

uij(t− 1) = 2N∑

i=1

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

xi = Cj(t − 1)’s mean vector

6

Isodata Algorithm

Pseudocode

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until ‖θ(t)− θ(t − 1)‖ = 0

– For i = 1 to N

∗ Find closest rep. for xi, say θj, and set

b(i) = j

– For j = 1 to m

∗ Set θj = mean of {xi ∈ X : b(i) = j}

• Guaranteed to converge to global minimum of

J if squared Euclidean distance used

• If e.g. Euclidean distance used, cannot guar-

antee this

7

Fuzzy Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij quantifies the fuzzy membership of xi in

cluster Cj

• To meet definition of fuzzy clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

i=1

uij < N and ∀i, j need uij ∈ [0,1]

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

Jq (θ, U) =N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

,

where q is a fuzzifier that, when > 1, allows

for fuzzy clusterings to have lower cost than

hard clusterings (Example 14.4, pp. 453–454)

8

Page 8: Clustering Algorithms Based on Cost Function Optimization

Hard Clustering Algorithms

Generalized Hard Algorithmic Scheme (GHAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign f.v.’s to clusters)

∗ For j = 1 to m

uij(t) =

1 if d(

xi, θj

)

= min1≤k≤m {d (xi, θk)}

0 otherwise

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uij(t− 1)∂d(

xi, θj

)

∂θj= 0 (1)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

5

Isodata (a.k.a. Hard k-Means

a.k.a. Hard c-Means) Algorithm

• Special case of GHAS: Each θj is point rep. of

Cj and DM is squared Euclidean distance

• So (1) becomes

N∑

i=1

uij(t− 1)∂

(∑`

k=1

(

xik − θjk

)2)

∂θj= 0

N∑

i=1

uij(t − 1)

2(

θj1 − xi1

)

...

2(

θj` − xi`

)

= 0

2θj

N∑

i=1

uij(t− 1) = 2N∑

i=1

uij(t − 1)xi

θj =1

|Cj(t− 1)|

xi∈Cj(t−1)

xi = Cj(t − 1)’s mean vector

6

Isodata Algorithm

Pseudocode

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until ‖θ(t)− θ(t − 1)‖ = 0

– For i = 1 to N

∗ Find closest rep. for xi, say θj, and set

b(i) = j

– For j = 1 to m

∗ Set θj = mean of {xi ∈ X : b(i) = j}

• Guaranteed to converge to global minimum of

J if squared Euclidean distance used

• If e.g. Euclidean distance used, cannot guar-

antee this

7

Fuzzy Clustering Algorithms

Definitions

• Let U be an N ×m matrix whose (i, j)th entry

uij quantifies the fuzzy membership of xi in

cluster Cj

• To meet definition of fuzzy clustering,

∀i ∈ {1, . . . , N} needm∑

j=1

uij = 1, ∀j ∈ {1, . . . , m}

need 0 <N∑

i=1

uij < N and ∀i, j need uij ∈ [0,1]

• Let θ = [θ1, . . . , θm] be an m-vector of repre-

sentatives of the m clusters

• Use cost function

Jq (θ, U) =N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

,

where q is a fuzzifier that, when > 1, allows

for fuzzy clusterings to have lower cost than

hard clusterings (Example 14.4, pp. 453–454)

8

Page 9: Clustering Algorithms Based on Cost Function Optimization

Fuzzy Clustering Algorithms

Minimizing Jq

• When fixing θ and minimizing w.r.t. U while

satisfying constraints, cannot use simple method

from slide 4

• Instead, use Lagrange multipliers (pp. 610–611)

to enforce constraint∑m

j=1 uij = 1∀i

J (θ, U) =

Jq︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

=0 when∑m

j=1 uij = 1∀i︷ ︸︸ ︷

N∑

i=1

λi

m∑

j=1

uij − 1

• Now minimize J w.r.t. U , yielding update equa-

tion in terms of λi’s, solve for λi’s using con-

straint, and end up with final update equation

for U

9

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

∂J (θ, U)

∂urs= q uq−1

rs d (xr, θs)− λr = 0

urs =

(

λr

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

j=1

urj = 1

m∑

j=1

λr

q d(

xr, θj

)

1/(q−1)

= 1

λr =q

(

∑mj=1

(

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

10

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

• For a fixed U , minimize Jq by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

∂J(θ, U)

∂θj=

N∑

i=1

uqij

∂d(

xi, θj

)

∂θj= 0

• As with hard clustering scheme, alternate be-

tween U and θ until a termination condition is

met

11

Fuzzy Clustering Algorithms

Generalized Fuzzy Algorithmic Scheme (GFAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign memb. values to f.v.’s)

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0 (2)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

12

Page 10: Clustering Algorithms Based on Cost Function Optimization

Fuzzy Clustering Algorithms

Minimizing Jq

• When fixing θ and minimizing w.r.t. U while

satisfying constraints, cannot use simple method

from slide 4

• Instead, use Lagrange multipliers (pp. 610–611)

to enforce constraint∑m

j=1 uij = 1∀i

J (θ, U) =

Jq︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

=0 when∑m

j=1 uij = 1∀i︷ ︸︸ ︷

N∑

i=1

λi

m∑

j=1

uij − 1

• Now minimize J w.r.t. U , yielding update equa-

tion in terms of λi’s, solve for λi’s using con-

straint, and end up with final update equation

for U

9

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

∂J (θ, U)

∂urs= q uq−1

rs d (xr, θs)− λr = 0

urs =

(

λr

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

j=1

urj = 1

m∑

j=1

λr

q d(

xr, θj

)

1/(q−1)

= 1

λr =q

(

∑mj=1

(

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

10

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

• For a fixed U , minimize Jq by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

∂J(θ, U)

∂θj=

N∑

i=1

uqij

∂d(

xi, θj

)

∂θj= 0

• As with hard clustering scheme, alternate be-

tween U and θ until a termination condition is

met

11

Fuzzy Clustering Algorithms

Generalized Fuzzy Algorithmic Scheme (GFAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign memb. values to f.v.’s)

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0 (2)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

12

Page 11: Clustering Algorithms Based on Cost Function Optimization

Fuzzy Clustering Algorithms

Minimizing Jq

• When fixing θ and minimizing w.r.t. U while

satisfying constraints, cannot use simple method

from slide 4

• Instead, use Lagrange multipliers (pp. 610–611)

to enforce constraint∑m

j=1 uij = 1∀i

J (θ, U) =

Jq︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

=0 when∑m

j=1 uij = 1∀i︷ ︸︸ ︷

N∑

i=1

λi

m∑

j=1

uij − 1

• Now minimize J w.r.t. U , yielding update equa-

tion in terms of λi’s, solve for λi’s using con-

straint, and end up with final update equation

for U

9

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

∂J (θ, U)

∂urs= q uq−1

rs d (xr, θs)− λr = 0

urs =

(

λr

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

j=1

urj = 1

m∑

j=1

λr

q d(

xr, θj

)

1/(q−1)

= 1

λr =q

(

∑mj=1

(

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

10

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

• For a fixed U , minimize Jq by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

∂J(θ, U)

∂θj=

N∑

i=1

uqij

∂d(

xi, θj

)

∂θj= 0

• As with hard clustering scheme, alternate be-

tween U and θ until a termination condition is

met

11

Fuzzy Clustering Algorithms

Generalized Fuzzy Algorithmic Scheme (GFAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign memb. values to f.v.’s)

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0 (2)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

12

Page 12: Clustering Algorithms Based on Cost Function Optimization

Fuzzy Clustering Algorithms

Minimizing Jq

• When fixing θ and minimizing w.r.t. U while

satisfying constraints, cannot use simple method

from slide 4

• Instead, use Lagrange multipliers (pp. 610–611)

to enforce constraint∑m

j=1 uij = 1∀i

J (θ, U) =

Jq︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

=0 when∑m

j=1 uij = 1∀i︷ ︸︸ ︷

N∑

i=1

λi

m∑

j=1

uij − 1

• Now minimize J w.r.t. U , yielding update equa-

tion in terms of λi’s, solve for λi’s using con-

straint, and end up with final update equation

for U

9

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

∂J (θ, U)

∂urs= q uq−1

rs d (xr, θs)− λr = 0

urs =

(

λr

q d (xr, θs)

)1/(q−1)

, s = 1, . . . , m

use constraintm∑

j=1

urj = 1

m∑

j=1

λr

q d(

xr, θj

)

1/(q−1)

= 1

λr =q

(

∑mj=1

(

1d(xr,θj)

)1/(q−1))q−1

urs =1

∑mj=1

(d(xr,θs)d(xr,θj)

)1/(q−1)

10

Fuzzy Clustering Algorithms

Minimizing Jq (cont’d)

• For a fixed U , minimize Jq by minimizing w.r.t.

each θj independently, so for j ∈ {1, . . . , m},

take gradient and set to 0 vector:

∂J(θ, U)

∂θj=

N∑

i=1

uqij

∂d(

xi, θj

)

∂θj= 0

• As with hard clustering scheme, alternate be-

tween U and θ until a termination condition is

met

11

Fuzzy Clustering Algorithms

Generalized Fuzzy Algorithmic Scheme (GFAS)

• Initialize t = 0, θj(t) for j = 1, . . . , m

• Repeat until termination condition met

– For i = 1 to N (assign memb. values to f.v.’s)

∗ For j = 1 to m

uij(t) =1

∑mk=1

(d(xi,θj)d(xi,θk)

)1/(q−1)

– t = t + 1

– For j = 1 to m, solve

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0 (2)

for θj and set θj(t) equal to it

• Example termination condition: Stop when

‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector

norm and ε is user-specified

12

Page 13: Clustering Algorithms Based on Cost Function Optimization

Fuzzy c-Means (a.k.a. Fuzzy k-Means)

Algorithm

• If we use GFAS with θj = point rep. of Cj and

DM = squared Euclidean dist., (2) becomes

2N∑

i=1

uqij(t − 1)

(

θj − xi

)

= 0

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

(3)

• Convergence guarantees?

• Instead of squared Euclidean distance, can use

d(

xi, θj

)

=(

xi − θj

)TA(

xi − θj

)

, (4)

where A is symmetric and positive definite

• Use of (4) doesn’t change (3)

13

Possibilistic Clustering Algorithms

• Similar to fuzzy clustering algorithms, but don’t

requirem∑

j=1

uij = 1 ∀i ∈ {1, . . . , N}

• Still need uij ∈ [0,1] ∀i, j and 0 <N∑

i=1

uij ≤ N

∀j ∈ {1, . . . , m}

• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}

• Instead of measuring degree of membership of

xi in Cj, now uij measures degree of compatability,

i.e. the possibility that xi belongs in Cj

• Cannot directly use fuzzy cost function of slide

8 since it’s trivially minimized with U = 0,

violating our new constraint, so use:

J(θ, U) =

original cost func.︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

+

decreases as uij’s increase︷ ︸︸ ︷

m∑

j=1

ηj

N∑

i=1

(

1− uij

)q

where ηj > 0 ∀j

14

Possibilistic Clustering Algorithms

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

(

xi, θj

)

− qηj

(

1− uij

)q−1= 0

uij =1

1 +

(d(xi,θj)

ηj

)1/(q−1)(5)

• Updating for θ is same as for GFAS:

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0

• Setting ηj’s can be done by running GFAS then

taking a weighted average of DM between xi’s

and θj:

ηj =

∑Ni=1 u

qij d

(

xi, θj

)

∑Ni=1 u

qij

[then run GPAS after setting ηj’s]

15

Possibilistic Clustering Algorithms

Benefit: Less Sensitivity to Outliers

• Since uij is inversely proportional to d(

xi, θj

)

,

updates to θ are less sensitive to outliers

ixix

CC

CC2

1

3 4

d

d d

d Distances within "natural clusters"<< d = distance to outlier

• For GHAS (slide 5), uij = 1 for one cluster, 0

for others

• For GFAS (slide 12), uij = 1/m = 1/4 for each

cluster

• For GPAS, uij gets arbitrarily small as d grows,

and is independent of distances to other θk’s

16

Page 14: Clustering Algorithms Based on Cost Function Optimization

Fuzzy c-Means (a.k.a. Fuzzy k-Means)

Algorithm

• If we use GFAS with θj = point rep. of Cj and

DM = squared Euclidean dist., (2) becomes

2N∑

i=1

uqij(t − 1)

(

θj − xi

)

= 0

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

(3)

• Convergence guarantees?

• Instead of squared Euclidean distance, can use

d(

xi, θj

)

=(

xi − θj

)TA(

xi − θj

)

, (4)

where A is symmetric and positive definite

• Use of (4) doesn’t change (3)

13

Possibilistic Clustering Algorithms

• Similar to fuzzy clustering algorithms, but don’t

requirem∑

j=1

uij = 1 ∀i ∈ {1, . . . , N}

• Still need uij ∈ [0,1] ∀i, j and 0 <N∑

i=1

uij ≤ N

∀j ∈ {1, . . . , m}

• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}

• Instead of measuring degree of membership of

xi in Cj, now uij measures degree of compatability,

i.e. the possibility that xi belongs in Cj

• Cannot directly use fuzzy cost function of slide

8 since it’s trivially minimized with U = 0,

violating our new constraint, so use:

J(θ, U) =

original cost func.︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

+

decreases as uij’s increase︷ ︸︸ ︷

m∑

j=1

ηj

N∑

i=1

(

1− uij

)q

where ηj > 0 ∀j

14

Possibilistic Clustering Algorithms

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

(

xi, θj

)

− qηj

(

1− uij

)q−1= 0

uij =1

1 +

(d(xi,θj)

ηj

)1/(q−1)(5)

• Updating for θ is same as for GFAS:

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0

• Setting ηj’s can be done by running GFAS then

taking a weighted average of DM between xi’s

and θj:

ηj =

∑Ni=1 u

qij d

(

xi, θj

)

∑Ni=1 u

qij

[then run GPAS after setting ηj’s]

15

Possibilistic Clustering Algorithms

Benefit: Less Sensitivity to Outliers

• Since uij is inversely proportional to d(

xi, θj

)

,

updates to θ are less sensitive to outliers

ixix

CC

CC2

1

3 4

d

d d

d Distances within "natural clusters"<< d = distance to outlier

• For GHAS (slide 5), uij = 1 for one cluster, 0

for others

• For GFAS (slide 12), uij = 1/m = 1/4 for each

cluster

• For GPAS, uij gets arbitrarily small as d grows,

and is independent of distances to other θk’s

16

Page 15: Clustering Algorithms Based on Cost Function Optimization

Fuzzy c-Means (a.k.a. Fuzzy k-Means)

Algorithm

• If we use GFAS with θj = point rep. of Cj and

DM = squared Euclidean dist., (2) becomes

2N∑

i=1

uqij(t − 1)

(

θj − xi

)

= 0

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

(3)

• Convergence guarantees?

• Instead of squared Euclidean distance, can use

d(

xi, θj

)

=(

xi − θj

)TA(

xi − θj

)

, (4)

where A is symmetric and positive definite

• Use of (4) doesn’t change (3)

13

Possibilistic Clustering Algorithms

• Similar to fuzzy clustering algorithms, but don’t

requirem∑

j=1

uij = 1 ∀i ∈ {1, . . . , N}

• Still need uij ∈ [0,1] ∀i, j and 0 <N∑

i=1

uij ≤ N

∀j ∈ {1, . . . , m}

• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}

• Instead of measuring degree of membership of

xi in Cj, now uij measures degree of compatability,

i.e. the possibility that xi belongs in Cj

• Cannot directly use fuzzy cost function of slide

8 since it’s trivially minimized with U = 0,

violating our new constraint, so use:

J(θ, U) =

original cost func.︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

+

decreases as uij’s increase︷ ︸︸ ︷

m∑

j=1

ηj

N∑

i=1

(

1− uij

)q

where ηj > 0 ∀j

14

Possibilistic Clustering Algorithms

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

(

xi, θj

)

− qηj

(

1− uij

)q−1= 0

uij =1

1 +

(d(xi,θj)

ηj

)1/(q−1)(5)

• Updating for θ is same as for GFAS:

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0

• Setting ηj’s can be done by running GFAS then

taking a weighted average of DM between xi’s

and θj:

ηj =

∑Ni=1 u

qij d

(

xi, θj

)

∑Ni=1 u

qij

[then run GPAS after setting ηj’s]

15

Possibilistic Clustering Algorithms

Benefit: Less Sensitivity to Outliers

• Since uij is inversely proportional to d(

xi, θj

)

,

updates to θ are less sensitive to outliers

ixix

CC

CC2

1

3 4

d

d d

d Distances within "natural clusters"<< d = distance to outlier

• For GHAS (slide 5), uij = 1 for one cluster, 0

for others

• For GFAS (slide 12), uij = 1/m = 1/4 for each

cluster

• For GPAS, uij gets arbitrarily small as d grows,

and is independent of distances to other θk’s

16

Page 16: Clustering Algorithms Based on Cost Function Optimization

Fuzzy c-Means (a.k.a. Fuzzy k-Means)

Algorithm

• If we use GFAS with θj = point rep. of Cj and

DM = squared Euclidean dist., (2) becomes

2N∑

i=1

uqij(t − 1)

(

θj − xi

)

= 0

θj(t) =

∑Ni=1 u

qij(t − 1)xi

∑Ni=1 u

qij(t− 1)

(3)

• Convergence guarantees?

• Instead of squared Euclidean distance, can use

d(

xi, θj

)

=(

xi − θj

)TA(

xi − θj

)

, (4)

where A is symmetric and positive definite

• Use of (4) doesn’t change (3)

13

Possibilistic Clustering Algorithms

• Similar to fuzzy clustering algorithms, but don’t

requirem∑

j=1

uij = 1 ∀i ∈ {1, . . . , N}

• Still need uij ∈ [0,1] ∀i, j and 0 <N∑

i=1

uij ≤ N

∀j ∈ {1, . . . , m}

• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}

• Instead of measuring degree of membership of

xi in Cj, now uij measures degree of compatability,

i.e. the possibility that xi belongs in Cj

• Cannot directly use fuzzy cost function of slide

8 since it’s trivially minimized with U = 0,

violating our new constraint, so use:

J(θ, U) =

original cost func.︷ ︸︸ ︷

N∑

i=1

m∑

j=1

uqij d

(

xi, θj

)

+

decreases as uij’s increase︷ ︸︸ ︷

m∑

j=1

ηj

N∑

i=1

(

1− uij

)q

where ηj > 0 ∀j

14

Possibilistic Clustering Algorithms

Minimizing J

∂J (θ, U)

∂uij= qu

q−1ij d

(

xi, θj

)

− qηj

(

1− uij

)q−1= 0

uij =1

1 +

(d(xi,θj)

ηj

)1/(q−1)(5)

• Updating for θ is same as for GFAS:

N∑

i=1

uqij(t − 1)

∂d(

xi, θj

)

∂θj= 0

• Setting ηj’s can be done by running GFAS then

taking a weighted average of DM between xi’s

and θj:

ηj =

∑Ni=1 u

qij d

(

xi, θj

)

∑Ni=1 u

qij

[then run GPAS after setting ηj’s]

15

Possibilistic Clustering Algorithms

Benefit: Less Sensitivity to Outliers

• Since uij is inversely proportional to d(

xi, θj

)

,

updates to θ are less sensitive to outliers

ixix

CC

CC2

1

3 4

d

d d

d Distances within "natural clusters"<< d = distance to outlier

• For GHAS (slide 5), uij = 1 for one cluster, 0

for others

• For GFAS (slide 12), uij = 1/m = 1/4 for each

cluster

• For GPAS, uij gets arbitrarily small as d grows,

and is independent of distances to other θk’s

16

Page 17: Clustering Algorithms Based on Cost Function Optimization

Possibilistic Clustering Algorithms

Benefit: Mode-Seeking Property

• Can write cost function as J(θ, U) =m∑

j=1

Jj for

Jj =N∑

i=1

uqij d

(

xi, θj

)

+ ηj

N∑

i=1

(

1− uij

)q(6)

and get same uij updates by minimizing Jj’s

individually

• Rewriting (5) as d(

xi, θj

)

= ηj

(

1− uij

uij

)q−1

and

substituting into (6) gives

Jj =N∑

i=1

uqijηj

(

1− uij

uij

)q−1

+ ηj

N∑

i=1

(

1− uij

)q

= ηj

N∑

i=1

(

1− uij

)q−1 (uij + 1− uij

)

= ηj

N∑

i=1

(

1− uij

)q−1

• So minimizing J ⇒ minimizing Jj’s ⇒

maximizing uij’s ⇒ minimizing d(

xi, θj

)

for each

j

17

Possibilistic Clustering Algorithms

Benefit: Mode-Seeking Property (cont’d)

• Thus GPAS seeks out regions dense in f.v.’s,

i.e. if m > k = number of natural clusters, then

properly initialized GPAS will have coincident

clusters

• So m need not be known a priori, only upper

bound and proper initialization

18

Page 18: Clustering Algorithms Based on Cost Function Optimization

Possibilistic Clustering Algorithms

Benefit: Mode-Seeking Property

• Can write cost function as J(θ, U) =m∑

j=1

Jj for

Jj =N∑

i=1

uqij d

(

xi, θj

)

+ ηj

N∑

i=1

(

1− uij

)q(6)

and get same uij updates by minimizing Jj’s

individually

• Rewriting (5) as d(

xi, θj

)

= ηj

(

1− uij

uij

)q−1

and

substituting into (6) gives

Jj =N∑

i=1

uqijηj

(

1− uij

uij

)q−1

+ ηj

N∑

i=1

(

1− uij

)q

= ηj

N∑

i=1

(

1− uij

)q−1 (uij + 1− uij

)

= ηj

N∑

i=1

(

1− uij

)q−1

• So minimizing J ⇒ minimizing Jj’s ⇒

maximizing uij’s ⇒ minimizing d(

xi, θj

)

for each

j

17

Possibilistic Clustering Algorithms

Benefit: Mode-Seeking Property (cont’d)

• Thus GPAS seeks out regions dense in f.v.’s,

i.e. if m > k = number of natural clusters, then

properly initialized GPAS will have coincident

clusters

• So m need not be known a priori, only upper

bound and proper initialization

18