Clustering Algorithms Based on Cost Function Optimization
-
Upload
ismail-adha-kesuma -
Category
Documents
-
view
103 -
download
1
Transcript of Clustering Algorithms Based on Cost Function Optimization
CSCE 970 Lecture 10:Clustering Algorithms Based on Cost
Function Optimization
Stephen D. Scott
April 9, 2001
1
Introduction
• A very popular family of clustering algorithms
• General procedure: Define a cost function J
measuring the goodness of a clustering and
search for a parameter vector θ to
minimize/maximize J
• Search is subject to certain constraints, e.g.
that the result fits the definition of a clustering
(slide 7.8)
• E.g. θ =[
mT1 ,mT
2 , . . . ,mTm
]T= the point rep-
resentatives of the m clusters
• θ can also be a vector of parameters for m
hyperplanar or hyperspherical representatives
• Typically algorithms assume m is known, so
may have to try several values in practice
• Skipping Bayesian-style approaches (Sec. 14.2)
and focusing on hard c-means (Isodata) and
fuzzy c-means algorithms
2
Hard Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij is 1 if f.v. xi is in cluster Cj and 0 otherwise
• To meet definition of hard clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1 and uij ∈ {0,1}
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
J (θ, U) =N∑
i=1
m∑
j=1
uij d(
xi, θj
)
• Globally minimizing J given a set of f.v.’s means
finding both a set of representatives θ and as-
signment to clusters U that minimizes DM be-
tween each f.v. and its representative
3
Hard Clustering Algorithms
Minimizing J
• For a fixed θ, J is minimized and constraints
met iff
uij =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
• For a fixed U , minimize J by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
N∑
i=1
uij
∂d(
xi, θj
)
∂θj= 0
• Can alternate between the above steps until a
termination condition is met
4
CSCE 970 Lecture 10:Clustering Algorithms Based on Cost
Function Optimization
Stephen D. Scott
April 9, 2001
1
Introduction
• A very popular family of clustering algorithms
• General procedure: Define a cost function J
measuring the goodness of a clustering and
search for a parameter vector θ to
minimize/maximize J
• Search is subject to certain constraints, e.g.
that the result fits the definition of a clustering
(slide 7.8)
• E.g. θ =[
mT1 ,mT
2 , . . . ,mTm
]T= the point rep-
resentatives of the m clusters
• θ can also be a vector of parameters for m
hyperplanar or hyperspherical representatives
• Typically algorithms assume m is known, so
may have to try several values in practice
• Skipping Bayesian-style approaches (Sec. 14.2)
and focusing on hard c-means (Isodata) and
fuzzy c-means algorithms
2
Hard Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij is 1 if f.v. xi is in cluster Cj and 0 otherwise
• To meet definition of hard clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1 and uij ∈ {0,1}
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
J (θ, U) =N∑
i=1
m∑
j=1
uij d(
xi, θj
)
• Globally minimizing J given a set of f.v.’s means
finding both a set of representatives θ and as-
signment to clusters U that minimizes DM be-
tween each f.v. and its representative
3
Hard Clustering Algorithms
Minimizing J
• For a fixed θ, J is minimized and constraints
met iff
uij =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
• For a fixed U , minimize J by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
N∑
i=1
uij
∂d(
xi, θj
)
∂θj= 0
• Can alternate between the above steps until a
termination condition is met
4
CSCE 970 Lecture 10:Clustering Algorithms Based on Cost
Function Optimization
Stephen D. Scott
April 9, 2001
1
Introduction
• A very popular family of clustering algorithms
• General procedure: Define a cost function J
measuring the goodness of a clustering and
search for a parameter vector θ to
minimize/maximize J
• Search is subject to certain constraints, e.g.
that the result fits the definition of a clustering
(slide 7.8)
• E.g. θ =[
mT1 ,mT
2 , . . . ,mTm
]T= the point rep-
resentatives of the m clusters
• θ can also be a vector of parameters for m
hyperplanar or hyperspherical representatives
• Typically algorithms assume m is known, so
may have to try several values in practice
• Skipping Bayesian-style approaches (Sec. 14.2)
and focusing on hard c-means (Isodata) and
fuzzy c-means algorithms
2
Hard Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij is 1 if f.v. xi is in cluster Cj and 0 otherwise
• To meet definition of hard clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1 and uij ∈ {0,1}
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
J (θ, U) =N∑
i=1
m∑
j=1
uij d(
xi, θj
)
• Globally minimizing J given a set of f.v.’s means
finding both a set of representatives θ and as-
signment to clusters U that minimizes DM be-
tween each f.v. and its representative
3
Hard Clustering Algorithms
Minimizing J
• For a fixed θ, J is minimized and constraints
met iff
uij =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
• For a fixed U , minimize J by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
N∑
i=1
uij
∂d(
xi, θj
)
∂θj= 0
• Can alternate between the above steps until a
termination condition is met
4
CSCE 970 Lecture 10:Clustering Algorithms Based on Cost
Function Optimization
Stephen D. Scott
April 9, 2001
1
Introduction
• A very popular family of clustering algorithms
• General procedure: Define a cost function J
measuring the goodness of a clustering and
search for a parameter vector θ to
minimize/maximize J
• Search is subject to certain constraints, e.g.
that the result fits the definition of a clustering
(slide 7.8)
• E.g. θ =[
mT1 ,mT
2 , . . . ,mTm
]T= the point rep-
resentatives of the m clusters
• θ can also be a vector of parameters for m
hyperplanar or hyperspherical representatives
• Typically algorithms assume m is known, so
may have to try several values in practice
• Skipping Bayesian-style approaches (Sec. 14.2)
and focusing on hard c-means (Isodata) and
fuzzy c-means algorithms
2
Hard Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij is 1 if f.v. xi is in cluster Cj and 0 otherwise
• To meet definition of hard clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1 and uij ∈ {0,1}
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
J (θ, U) =N∑
i=1
m∑
j=1
uij d(
xi, θj
)
• Globally minimizing J given a set of f.v.’s means
finding both a set of representatives θ and as-
signment to clusters U that minimizes DM be-
tween each f.v. and its representative
3
Hard Clustering Algorithms
Minimizing J
• For a fixed θ, J is minimized and constraints
met iff
uij =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
• For a fixed U , minimize J by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
N∑
i=1
uij
∂d(
xi, θj
)
∂θj= 0
• Can alternate between the above steps until a
termination condition is met
4
Hard Clustering Algorithms
Generalized Hard Algorithmic Scheme (GHAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign f.v.’s to clusters)
∗ For j = 1 to m
uij(t) =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uij(t− 1)∂d(
xi, θj
)
∂θj= 0 (1)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
5
Isodata (a.k.a. Hard k-Means
a.k.a. Hard c-Means) Algorithm
• Special case of GHAS: Each θj is point rep. of
Cj and DM is squared Euclidean distance
• So (1) becomes
N∑
i=1
uij(t− 1)∂
(∑`
k=1
(
xik − θjk
)2)
∂θj= 0
⇓
N∑
i=1
uij(t − 1)
2(
θj1 − xi1
)
...
2(
θj` − xi`
)
= 0
⇓
2θj
N∑
i=1
uij(t− 1) = 2N∑
i=1
uij(t − 1)xi
⇓
θj =1
|Cj(t− 1)|
∑
xi∈Cj(t−1)
xi = Cj(t − 1)’s mean vector
6
Isodata Algorithm
Pseudocode
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until ‖θ(t)− θ(t − 1)‖ = 0
– For i = 1 to N
∗ Find closest rep. for xi, say θj, and set
b(i) = j
– For j = 1 to m
∗ Set θj = mean of {xi ∈ X : b(i) = j}
• Guaranteed to converge to global minimum of
J if squared Euclidean distance used
• If e.g. Euclidean distance used, cannot guar-
antee this
7
Fuzzy Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij quantifies the fuzzy membership of xi in
cluster Cj
• To meet definition of fuzzy clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1, ∀j ∈ {1, . . . , m}
need 0 <N∑
i=1
uij < N and ∀i, j need uij ∈ [0,1]
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
Jq (θ, U) =N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
,
where q is a fuzzifier that, when > 1, allows
for fuzzy clusterings to have lower cost than
hard clusterings (Example 14.4, pp. 453–454)
8
Hard Clustering Algorithms
Generalized Hard Algorithmic Scheme (GHAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign f.v.’s to clusters)
∗ For j = 1 to m
uij(t) =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uij(t− 1)∂d(
xi, θj
)
∂θj= 0 (1)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
5
Isodata (a.k.a. Hard k-Means
a.k.a. Hard c-Means) Algorithm
• Special case of GHAS: Each θj is point rep. of
Cj and DM is squared Euclidean distance
• So (1) becomes
N∑
i=1
uij(t− 1)∂
(∑`
k=1
(
xik − θjk
)2)
∂θj= 0
⇓
N∑
i=1
uij(t − 1)
2(
θj1 − xi1
)
...
2(
θj` − xi`
)
= 0
⇓
2θj
N∑
i=1
uij(t− 1) = 2N∑
i=1
uij(t − 1)xi
⇓
θj =1
|Cj(t− 1)|
∑
xi∈Cj(t−1)
xi = Cj(t − 1)’s mean vector
6
Isodata Algorithm
Pseudocode
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until ‖θ(t)− θ(t − 1)‖ = 0
– For i = 1 to N
∗ Find closest rep. for xi, say θj, and set
b(i) = j
– For j = 1 to m
∗ Set θj = mean of {xi ∈ X : b(i) = j}
• Guaranteed to converge to global minimum of
J if squared Euclidean distance used
• If e.g. Euclidean distance used, cannot guar-
antee this
7
Fuzzy Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij quantifies the fuzzy membership of xi in
cluster Cj
• To meet definition of fuzzy clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1, ∀j ∈ {1, . . . , m}
need 0 <N∑
i=1
uij < N and ∀i, j need uij ∈ [0,1]
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
Jq (θ, U) =N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
,
where q is a fuzzifier that, when > 1, allows
for fuzzy clusterings to have lower cost than
hard clusterings (Example 14.4, pp. 453–454)
8
Hard Clustering Algorithms
Generalized Hard Algorithmic Scheme (GHAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign f.v.’s to clusters)
∗ For j = 1 to m
uij(t) =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uij(t− 1)∂d(
xi, θj
)
∂θj= 0 (1)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
5
Isodata (a.k.a. Hard k-Means
a.k.a. Hard c-Means) Algorithm
• Special case of GHAS: Each θj is point rep. of
Cj and DM is squared Euclidean distance
• So (1) becomes
N∑
i=1
uij(t− 1)∂
(∑`
k=1
(
xik − θjk
)2)
∂θj= 0
⇓
N∑
i=1
uij(t − 1)
2(
θj1 − xi1
)
...
2(
θj` − xi`
)
= 0
⇓
2θj
N∑
i=1
uij(t− 1) = 2N∑
i=1
uij(t − 1)xi
⇓
θj =1
|Cj(t− 1)|
∑
xi∈Cj(t−1)
xi = Cj(t − 1)’s mean vector
6
Isodata Algorithm
Pseudocode
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until ‖θ(t)− θ(t − 1)‖ = 0
– For i = 1 to N
∗ Find closest rep. for xi, say θj, and set
b(i) = j
– For j = 1 to m
∗ Set θj = mean of {xi ∈ X : b(i) = j}
• Guaranteed to converge to global minimum of
J if squared Euclidean distance used
• If e.g. Euclidean distance used, cannot guar-
antee this
7
Fuzzy Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij quantifies the fuzzy membership of xi in
cluster Cj
• To meet definition of fuzzy clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1, ∀j ∈ {1, . . . , m}
need 0 <N∑
i=1
uij < N and ∀i, j need uij ∈ [0,1]
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
Jq (θ, U) =N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
,
where q is a fuzzifier that, when > 1, allows
for fuzzy clusterings to have lower cost than
hard clusterings (Example 14.4, pp. 453–454)
8
Hard Clustering Algorithms
Generalized Hard Algorithmic Scheme (GHAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign f.v.’s to clusters)
∗ For j = 1 to m
uij(t) =
1 if d(
xi, θj
)
= min1≤k≤m {d (xi, θk)}
0 otherwise
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uij(t− 1)∂d(
xi, θj
)
∂θj= 0 (1)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
5
Isodata (a.k.a. Hard k-Means
a.k.a. Hard c-Means) Algorithm
• Special case of GHAS: Each θj is point rep. of
Cj and DM is squared Euclidean distance
• So (1) becomes
N∑
i=1
uij(t− 1)∂
(∑`
k=1
(
xik − θjk
)2)
∂θj= 0
⇓
N∑
i=1
uij(t − 1)
2(
θj1 − xi1
)
...
2(
θj` − xi`
)
= 0
⇓
2θj
N∑
i=1
uij(t− 1) = 2N∑
i=1
uij(t − 1)xi
⇓
θj =1
|Cj(t− 1)|
∑
xi∈Cj(t−1)
xi = Cj(t − 1)’s mean vector
6
Isodata Algorithm
Pseudocode
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until ‖θ(t)− θ(t − 1)‖ = 0
– For i = 1 to N
∗ Find closest rep. for xi, say θj, and set
b(i) = j
– For j = 1 to m
∗ Set θj = mean of {xi ∈ X : b(i) = j}
• Guaranteed to converge to global minimum of
J if squared Euclidean distance used
• If e.g. Euclidean distance used, cannot guar-
antee this
7
Fuzzy Clustering Algorithms
Definitions
• Let U be an N ×m matrix whose (i, j)th entry
uij quantifies the fuzzy membership of xi in
cluster Cj
• To meet definition of fuzzy clustering,
∀i ∈ {1, . . . , N} needm∑
j=1
uij = 1, ∀j ∈ {1, . . . , m}
need 0 <N∑
i=1
uij < N and ∀i, j need uij ∈ [0,1]
• Let θ = [θ1, . . . , θm] be an m-vector of repre-
sentatives of the m clusters
• Use cost function
Jq (θ, U) =N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
,
where q is a fuzzifier that, when > 1, allows
for fuzzy clusterings to have lower cost than
hard clusterings (Example 14.4, pp. 453–454)
8
Fuzzy Clustering Algorithms
Minimizing Jq
• When fixing θ and minimizing w.r.t. U while
satisfying constraints, cannot use simple method
from slide 4
• Instead, use Lagrange multipliers (pp. 610–611)
to enforce constraint∑m
j=1 uij = 1∀i
J (θ, U) =
Jq︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
−
=0 when∑m
j=1 uij = 1∀i︷ ︸︸ ︷
N∑
i=1
λi
m∑
j=1
uij − 1
• Now minimize J w.r.t. U , yielding update equa-
tion in terms of λi’s, solve for λi’s using con-
straint, and end up with final update equation
for U
9
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
∂J (θ, U)
∂urs= q uq−1
rs d (xr, θs)− λr = 0
⇓
urs =
(
λr
q d (xr, θs)
)1/(q−1)
, s = 1, . . . , m
⇓
use constraintm∑
j=1
urj = 1
m∑
j=1
λr
q d(
xr, θj
)
1/(q−1)
= 1
⇓
λr =q
(
∑mj=1
(
1d(xr,θj)
)1/(q−1))q−1
⇓
urs =1
∑mj=1
(d(xr,θs)d(xr,θj)
)1/(q−1)
10
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
• For a fixed U , minimize Jq by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
∂J(θ, U)
∂θj=
N∑
i=1
uqij
∂d(
xi, θj
)
∂θj= 0
• As with hard clustering scheme, alternate be-
tween U and θ until a termination condition is
met
11
Fuzzy Clustering Algorithms
Generalized Fuzzy Algorithmic Scheme (GFAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign memb. values to f.v.’s)
∗ For j = 1 to m
uij(t) =1
∑mk=1
(d(xi,θj)d(xi,θk)
)1/(q−1)
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0 (2)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
12
Fuzzy Clustering Algorithms
Minimizing Jq
• When fixing θ and minimizing w.r.t. U while
satisfying constraints, cannot use simple method
from slide 4
• Instead, use Lagrange multipliers (pp. 610–611)
to enforce constraint∑m
j=1 uij = 1∀i
J (θ, U) =
Jq︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
−
=0 when∑m
j=1 uij = 1∀i︷ ︸︸ ︷
N∑
i=1
λi
m∑
j=1
uij − 1
• Now minimize J w.r.t. U , yielding update equa-
tion in terms of λi’s, solve for λi’s using con-
straint, and end up with final update equation
for U
9
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
∂J (θ, U)
∂urs= q uq−1
rs d (xr, θs)− λr = 0
⇓
urs =
(
λr
q d (xr, θs)
)1/(q−1)
, s = 1, . . . , m
⇓
use constraintm∑
j=1
urj = 1
m∑
j=1
λr
q d(
xr, θj
)
1/(q−1)
= 1
⇓
λr =q
(
∑mj=1
(
1d(xr,θj)
)1/(q−1))q−1
⇓
urs =1
∑mj=1
(d(xr,θs)d(xr,θj)
)1/(q−1)
10
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
• For a fixed U , minimize Jq by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
∂J(θ, U)
∂θj=
N∑
i=1
uqij
∂d(
xi, θj
)
∂θj= 0
• As with hard clustering scheme, alternate be-
tween U and θ until a termination condition is
met
11
Fuzzy Clustering Algorithms
Generalized Fuzzy Algorithmic Scheme (GFAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign memb. values to f.v.’s)
∗ For j = 1 to m
uij(t) =1
∑mk=1
(d(xi,θj)d(xi,θk)
)1/(q−1)
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0 (2)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
12
Fuzzy Clustering Algorithms
Minimizing Jq
• When fixing θ and minimizing w.r.t. U while
satisfying constraints, cannot use simple method
from slide 4
• Instead, use Lagrange multipliers (pp. 610–611)
to enforce constraint∑m
j=1 uij = 1∀i
J (θ, U) =
Jq︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
−
=0 when∑m
j=1 uij = 1∀i︷ ︸︸ ︷
N∑
i=1
λi
m∑
j=1
uij − 1
• Now minimize J w.r.t. U , yielding update equa-
tion in terms of λi’s, solve for λi’s using con-
straint, and end up with final update equation
for U
9
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
∂J (θ, U)
∂urs= q uq−1
rs d (xr, θs)− λr = 0
⇓
urs =
(
λr
q d (xr, θs)
)1/(q−1)
, s = 1, . . . , m
⇓
use constraintm∑
j=1
urj = 1
m∑
j=1
λr
q d(
xr, θj
)
1/(q−1)
= 1
⇓
λr =q
(
∑mj=1
(
1d(xr,θj)
)1/(q−1))q−1
⇓
urs =1
∑mj=1
(d(xr,θs)d(xr,θj)
)1/(q−1)
10
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
• For a fixed U , minimize Jq by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
∂J(θ, U)
∂θj=
N∑
i=1
uqij
∂d(
xi, θj
)
∂θj= 0
• As with hard clustering scheme, alternate be-
tween U and θ until a termination condition is
met
11
Fuzzy Clustering Algorithms
Generalized Fuzzy Algorithmic Scheme (GFAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign memb. values to f.v.’s)
∗ For j = 1 to m
uij(t) =1
∑mk=1
(d(xi,θj)d(xi,θk)
)1/(q−1)
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0 (2)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
12
Fuzzy Clustering Algorithms
Minimizing Jq
• When fixing θ and minimizing w.r.t. U while
satisfying constraints, cannot use simple method
from slide 4
• Instead, use Lagrange multipliers (pp. 610–611)
to enforce constraint∑m
j=1 uij = 1∀i
J (θ, U) =
Jq︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
−
=0 when∑m
j=1 uij = 1∀i︷ ︸︸ ︷
N∑
i=1
λi
m∑
j=1
uij − 1
• Now minimize J w.r.t. U , yielding update equa-
tion in terms of λi’s, solve for λi’s using con-
straint, and end up with final update equation
for U
9
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
∂J (θ, U)
∂urs= q uq−1
rs d (xr, θs)− λr = 0
⇓
urs =
(
λr
q d (xr, θs)
)1/(q−1)
, s = 1, . . . , m
⇓
use constraintm∑
j=1
urj = 1
m∑
j=1
λr
q d(
xr, θj
)
1/(q−1)
= 1
⇓
λr =q
(
∑mj=1
(
1d(xr,θj)
)1/(q−1))q−1
⇓
urs =1
∑mj=1
(d(xr,θs)d(xr,θj)
)1/(q−1)
10
Fuzzy Clustering Algorithms
Minimizing Jq (cont’d)
• For a fixed U , minimize Jq by minimizing w.r.t.
each θj independently, so for j ∈ {1, . . . , m},
take gradient and set to 0 vector:
∂J(θ, U)
∂θj=
N∑
i=1
uqij
∂d(
xi, θj
)
∂θj= 0
• As with hard clustering scheme, alternate be-
tween U and θ until a termination condition is
met
11
Fuzzy Clustering Algorithms
Generalized Fuzzy Algorithmic Scheme (GFAS)
• Initialize t = 0, θj(t) for j = 1, . . . , m
• Repeat until termination condition met
– For i = 1 to N (assign memb. values to f.v.’s)
∗ For j = 1 to m
uij(t) =1
∑mk=1
(d(xi,θj)d(xi,θk)
)1/(q−1)
– t = t + 1
– For j = 1 to m, solve
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0 (2)
for θj and set θj(t) equal to it
• Example termination condition: Stop when
‖θ(t) − θ(t − 1)‖ < ε, where ‖ · ‖ is any vector
norm and ε is user-specified
12
Fuzzy c-Means (a.k.a. Fuzzy k-Means)
Algorithm
• If we use GFAS with θj = point rep. of Cj and
DM = squared Euclidean dist., (2) becomes
2N∑
i=1
uqij(t − 1)
(
θj − xi
)
= 0
⇓
θj(t) =
∑Ni=1 u
qij(t − 1)xi
∑Ni=1 u
qij(t− 1)
(3)
• Convergence guarantees?
• Instead of squared Euclidean distance, can use
d(
xi, θj
)
=(
xi − θj
)TA(
xi − θj
)
, (4)
where A is symmetric and positive definite
• Use of (4) doesn’t change (3)
13
Possibilistic Clustering Algorithms
• Similar to fuzzy clustering algorithms, but don’t
requirem∑
j=1
uij = 1 ∀i ∈ {1, . . . , N}
• Still need uij ∈ [0,1] ∀i, j and 0 <N∑
i=1
uij ≤ N
∀j ∈ {1, . . . , m}
• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}
• Instead of measuring degree of membership of
xi in Cj, now uij measures degree of compatability,
i.e. the possibility that xi belongs in Cj
• Cannot directly use fuzzy cost function of slide
8 since it’s trivially minimized with U = 0,
violating our new constraint, so use:
J(θ, U) =
original cost func.︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
+
decreases as uij’s increase︷ ︸︸ ︷
m∑
j=1
ηj
N∑
i=1
(
1− uij
)q
where ηj > 0 ∀j
14
Possibilistic Clustering Algorithms
Minimizing J
∂J (θ, U)
∂uij= qu
q−1ij d
(
xi, θj
)
− qηj
(
1− uij
)q−1= 0
⇓
uij =1
1 +
(d(xi,θj)
ηj
)1/(q−1)(5)
• Updating for θ is same as for GFAS:
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0
• Setting ηj’s can be done by running GFAS then
taking a weighted average of DM between xi’s
and θj:
ηj =
∑Ni=1 u
qij d
(
xi, θj
)
∑Ni=1 u
qij
[then run GPAS after setting ηj’s]
15
Possibilistic Clustering Algorithms
Benefit: Less Sensitivity to Outliers
• Since uij is inversely proportional to d(
xi, θj
)
,
updates to θ are less sensitive to outliers
ixix
CC
CC2
1
3 4
d
d d
d Distances within "natural clusters"<< d = distance to outlier
• For GHAS (slide 5), uij = 1 for one cluster, 0
for others
• For GFAS (slide 12), uij = 1/m = 1/4 for each
cluster
• For GPAS, uij gets arbitrarily small as d grows,
and is independent of distances to other θk’s
16
Fuzzy c-Means (a.k.a. Fuzzy k-Means)
Algorithm
• If we use GFAS with θj = point rep. of Cj and
DM = squared Euclidean dist., (2) becomes
2N∑
i=1
uqij(t − 1)
(
θj − xi
)
= 0
⇓
θj(t) =
∑Ni=1 u
qij(t − 1)xi
∑Ni=1 u
qij(t− 1)
(3)
• Convergence guarantees?
• Instead of squared Euclidean distance, can use
d(
xi, θj
)
=(
xi − θj
)TA(
xi − θj
)
, (4)
where A is symmetric and positive definite
• Use of (4) doesn’t change (3)
13
Possibilistic Clustering Algorithms
• Similar to fuzzy clustering algorithms, but don’t
requirem∑
j=1
uij = 1 ∀i ∈ {1, . . . , N}
• Still need uij ∈ [0,1] ∀i, j and 0 <N∑
i=1
uij ≤ N
∀j ∈ {1, . . . , m}
• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}
• Instead of measuring degree of membership of
xi in Cj, now uij measures degree of compatability,
i.e. the possibility that xi belongs in Cj
• Cannot directly use fuzzy cost function of slide
8 since it’s trivially minimized with U = 0,
violating our new constraint, so use:
J(θ, U) =
original cost func.︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
+
decreases as uij’s increase︷ ︸︸ ︷
m∑
j=1
ηj
N∑
i=1
(
1− uij
)q
where ηj > 0 ∀j
14
Possibilistic Clustering Algorithms
Minimizing J
∂J (θ, U)
∂uij= qu
q−1ij d
(
xi, θj
)
− qηj
(
1− uij
)q−1= 0
⇓
uij =1
1 +
(d(xi,θj)
ηj
)1/(q−1)(5)
• Updating for θ is same as for GFAS:
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0
• Setting ηj’s can be done by running GFAS then
taking a weighted average of DM between xi’s
and θj:
ηj =
∑Ni=1 u
qij d
(
xi, θj
)
∑Ni=1 u
qij
[then run GPAS after setting ηj’s]
15
Possibilistic Clustering Algorithms
Benefit: Less Sensitivity to Outliers
• Since uij is inversely proportional to d(
xi, θj
)
,
updates to θ are less sensitive to outliers
ixix
CC
CC2
1
3 4
d
d d
d Distances within "natural clusters"<< d = distance to outlier
• For GHAS (slide 5), uij = 1 for one cluster, 0
for others
• For GFAS (slide 12), uij = 1/m = 1/4 for each
cluster
• For GPAS, uij gets arbitrarily small as d grows,
and is independent of distances to other θk’s
16
Fuzzy c-Means (a.k.a. Fuzzy k-Means)
Algorithm
• If we use GFAS with θj = point rep. of Cj and
DM = squared Euclidean dist., (2) becomes
2N∑
i=1
uqij(t − 1)
(
θj − xi
)
= 0
⇓
θj(t) =
∑Ni=1 u
qij(t − 1)xi
∑Ni=1 u
qij(t− 1)
(3)
• Convergence guarantees?
• Instead of squared Euclidean distance, can use
d(
xi, θj
)
=(
xi − θj
)TA(
xi − θj
)
, (4)
where A is symmetric and positive definite
• Use of (4) doesn’t change (3)
13
Possibilistic Clustering Algorithms
• Similar to fuzzy clustering algorithms, but don’t
requirem∑
j=1
uij = 1 ∀i ∈ {1, . . . , N}
• Still need uij ∈ [0,1] ∀i, j and 0 <N∑
i=1
uij ≤ N
∀j ∈ {1, . . . , m}
• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}
• Instead of measuring degree of membership of
xi in Cj, now uij measures degree of compatability,
i.e. the possibility that xi belongs in Cj
• Cannot directly use fuzzy cost function of slide
8 since it’s trivially minimized with U = 0,
violating our new constraint, so use:
J(θ, U) =
original cost func.︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
+
decreases as uij’s increase︷ ︸︸ ︷
m∑
j=1
ηj
N∑
i=1
(
1− uij
)q
where ηj > 0 ∀j
14
Possibilistic Clustering Algorithms
Minimizing J
∂J (θ, U)
∂uij= qu
q−1ij d
(
xi, θj
)
− qηj
(
1− uij
)q−1= 0
⇓
uij =1
1 +
(d(xi,θj)
ηj
)1/(q−1)(5)
• Updating for θ is same as for GFAS:
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0
• Setting ηj’s can be done by running GFAS then
taking a weighted average of DM between xi’s
and θj:
ηj =
∑Ni=1 u
qij d
(
xi, θj
)
∑Ni=1 u
qij
[then run GPAS after setting ηj’s]
15
Possibilistic Clustering Algorithms
Benefit: Less Sensitivity to Outliers
• Since uij is inversely proportional to d(
xi, θj
)
,
updates to θ are less sensitive to outliers
ixix
CC
CC2
1
3 4
d
d d
d Distances within "natural clusters"<< d = distance to outlier
• For GHAS (slide 5), uij = 1 for one cluster, 0
for others
• For GFAS (slide 12), uij = 1/m = 1/4 for each
cluster
• For GPAS, uij gets arbitrarily small as d grows,
and is independent of distances to other θk’s
16
Fuzzy c-Means (a.k.a. Fuzzy k-Means)
Algorithm
• If we use GFAS with θj = point rep. of Cj and
DM = squared Euclidean dist., (2) becomes
2N∑
i=1
uqij(t − 1)
(
θj − xi
)
= 0
⇓
θj(t) =
∑Ni=1 u
qij(t − 1)xi
∑Ni=1 u
qij(t− 1)
(3)
• Convergence guarantees?
• Instead of squared Euclidean distance, can use
d(
xi, θj
)
=(
xi − θj
)TA(
xi − θj
)
, (4)
where A is symmetric and positive definite
• Use of (4) doesn’t change (3)
13
Possibilistic Clustering Algorithms
• Similar to fuzzy clustering algorithms, but don’t
requirem∑
j=1
uij = 1 ∀i ∈ {1, . . . , N}
• Still need uij ∈ [0,1] ∀i, j and 0 <N∑
i=1
uij ≤ N
∀j ∈ {1, . . . , m}
• In addition, require some uij > 0 ∀i ∈ {1, . . . , N}
• Instead of measuring degree of membership of
xi in Cj, now uij measures degree of compatability,
i.e. the possibility that xi belongs in Cj
• Cannot directly use fuzzy cost function of slide
8 since it’s trivially minimized with U = 0,
violating our new constraint, so use:
J(θ, U) =
original cost func.︷ ︸︸ ︷
N∑
i=1
m∑
j=1
uqij d
(
xi, θj
)
+
decreases as uij’s increase︷ ︸︸ ︷
m∑
j=1
ηj
N∑
i=1
(
1− uij
)q
where ηj > 0 ∀j
14
Possibilistic Clustering Algorithms
Minimizing J
∂J (θ, U)
∂uij= qu
q−1ij d
(
xi, θj
)
− qηj
(
1− uij
)q−1= 0
⇓
uij =1
1 +
(d(xi,θj)
ηj
)1/(q−1)(5)
• Updating for θ is same as for GFAS:
N∑
i=1
uqij(t − 1)
∂d(
xi, θj
)
∂θj= 0
• Setting ηj’s can be done by running GFAS then
taking a weighted average of DM between xi’s
and θj:
ηj =
∑Ni=1 u
qij d
(
xi, θj
)
∑Ni=1 u
qij
[then run GPAS after setting ηj’s]
15
Possibilistic Clustering Algorithms
Benefit: Less Sensitivity to Outliers
• Since uij is inversely proportional to d(
xi, θj
)
,
updates to θ are less sensitive to outliers
ixix
CC
CC2
1
3 4
d
d d
d Distances within "natural clusters"<< d = distance to outlier
• For GHAS (slide 5), uij = 1 for one cluster, 0
for others
• For GFAS (slide 12), uij = 1/m = 1/4 for each
cluster
• For GPAS, uij gets arbitrarily small as d grows,
and is independent of distances to other θk’s
16
Possibilistic Clustering Algorithms
Benefit: Mode-Seeking Property
• Can write cost function as J(θ, U) =m∑
j=1
Jj for
Jj =N∑
i=1
uqij d
(
xi, θj
)
+ ηj
N∑
i=1
(
1− uij
)q(6)
and get same uij updates by minimizing Jj’s
individually
• Rewriting (5) as d(
xi, θj
)
= ηj
(
1− uij
uij
)q−1
and
substituting into (6) gives
Jj =N∑
i=1
uqijηj
(
1− uij
uij
)q−1
+ ηj
N∑
i=1
(
1− uij
)q
= ηj
N∑
i=1
(
1− uij
)q−1 (uij + 1− uij
)
= ηj
N∑
i=1
(
1− uij
)q−1
• So minimizing J ⇒ minimizing Jj’s ⇒
maximizing uij’s ⇒ minimizing d(
xi, θj
)
for each
j
17
Possibilistic Clustering Algorithms
Benefit: Mode-Seeking Property (cont’d)
• Thus GPAS seeks out regions dense in f.v.’s,
i.e. if m > k = number of natural clusters, then
properly initialized GPAS will have coincident
clusters
• So m need not be known a priori, only upper
bound and proper initialization
18
Possibilistic Clustering Algorithms
Benefit: Mode-Seeking Property
• Can write cost function as J(θ, U) =m∑
j=1
Jj for
Jj =N∑
i=1
uqij d
(
xi, θj
)
+ ηj
N∑
i=1
(
1− uij
)q(6)
and get same uij updates by minimizing Jj’s
individually
• Rewriting (5) as d(
xi, θj
)
= ηj
(
1− uij
uij
)q−1
and
substituting into (6) gives
Jj =N∑
i=1
uqijηj
(
1− uij
uij
)q−1
+ ηj
N∑
i=1
(
1− uij
)q
= ηj
N∑
i=1
(
1− uij
)q−1 (uij + 1− uij
)
= ηj
N∑
i=1
(
1− uij
)q−1
• So minimizing J ⇒ minimizing Jj’s ⇒
maximizing uij’s ⇒ minimizing d(
xi, θj
)
for each
j
17
Possibilistic Clustering Algorithms
Benefit: Mode-Seeking Property (cont’d)
• Thus GPAS seeks out regions dense in f.v.’s,
i.e. if m > k = number of natural clusters, then
properly initialized GPAS will have coincident
clusters
• So m need not be known a priori, only upper
bound and proper initialization
18