Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods
description
Transcript of Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods
Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods
Julie Sungsoon HwangDepartment of Geography, University of Washington
Jean-Claude ThillDepartment of Geography, State University of New York at Buffalo
November 10, 2005North American Meetings of Regional Science Association International
Outlines
• Research objectives
• Methodology: specification
• Methodology: illustration
• Evaluating the performance of fuzzy clustering
• Conclusions
Research objectives
• Demonstrate the use of fuzzy c-means (FCM) algorithm for delineating housing submarkets– Comparison to K-means
• Discuss empirical characteristics of FCM applied to given applications, in particular choice of parameters– Cluster validity index
Challenges
• Are the boundaries of clusters crisp?
Cluster A
Cluster C
X1
X2
Housing market in metropolitan area q
Cluster B
Cluster A
Cluster B Cluster C
X1
X2
Housing market in metropolitan area p
Methodology: specification
• Our task is to group census tracts to homogeneous housing submarkets within a metropolitan area
• Using fuzzy c-means algorithm• In order to examine whether fuzzy set-based
clustering can do the better job• Implemented in 85 metropolitan areas• Most of data set are public (e.g. 2000 Census)• The whole procedure is automated in GIS
Methodology: flow chart
National
Regional
Local…Census Tract Layer
# x1 x2 x3 … xm
1
2
3
…
n
# y1 y2 … yk
1
2
3
…
n
Cluster Analysis# U1 U2 … Uc
1 1 0 … 0
2 0 1 … 0
… 0 1 … 0
n 0 0 … 1
# U1 U2 … Uc
1 0.85 0.05 … 0.10
2 0.12 0.80 .. 0.05
… 0.02 0.74 … 0.12
n 0.40 0.03 … 0.50
K-means
Fuzzy Fuzzy CC--meansmeans
Candidate variables
Significant variables
Stepwise regression (k ≤ m)
Metro
Hard Cluster Layer
(c ≤ n)
Fuzzy Cluster Layer
…1
2
c
k: # selected variables
c: # submarkets
For each metropolitan area
Uj: membership to cluster j
Explanatory variables for house priceVar_Name Variable Definition Data Year Spatial Unit
Socioeconomic/demographic Characteristics of Residents
pcincome per capita income Census 2000 Census Tract
college % college degree Census 2000 Census Tract
managep % management workers Census 2000 Census Tract
prodp % production workers Census 2000 Census Tract
famcpchl % family with children Census 2000 Census Tract
nfmalone % nonfamily living alone Census 2000 Census Tract
black_p % black Census 2000 Census Tract
nhwht_p % non-hispanic white Census 2000 Census Tract
nativebr % native born Census 2000 Census Tract
Structural Characteristics of Housing Units
medroom median number of room Census 2000 Census Tract
hudetp % detached housing unit Census 2000 Census Tract
yrhublt median year structure built Census 2000 Census Tract
Locational Characteristics (Amenities) of Neighborhoods
ptratio pupil to teacher ratio NCES* 2002 School District
schexp school expenditure per student NCES 2002 School District
vrlcrime violent crime rate FBI** 2003 Designated Place
prpcrime property crime rate FBI 2003 Designated Place
jobacm job accessibility (Hansen 1959) CTPP*** 2000 Census Tract
*National Center for Education Statistics; **FBI annual report “Crime in the U.S. 2003”; *** CTPP: Census Transportation Planning Package Dependent variables: median home value of owner-occupied housing units
Metropolitan AreasCMSAMSA
State
300 0 300 600 Miles
N
Source: TIGER/Line 1999
Metropolitan AreasCMSAMSA
StateStudy Set
300 0 300 600 Miles
N
Source: TIGER/Line 1999
Study set: 85 metropolitan areas
kx
iv
• Clustering method that minimizes the following objective function:
• Updates cluster means vi and membership degree uik until the algorithm converges
ikum
2
1 1
( )n c
mik k i A
k i
u x v
Vectors of data point, 1 ≤ k ≤ n
Center of cluster i, 1 ≤ i ≤ c
Membership degree of data point k with cluster i; [0,1]
Fuzziness amount associated with assigning data point k to cluster i, 1≤ m ≤ ∞
1 1
n nm m
i ik k ikk k
v u x u
12/( 1)
1
mc
k iik
j k j
x vu
x v
Source: Bezdek 1981
#
#
#
#
#
#
#
#
#
#
#
#
#
#
####
#
#
#
#
#
#
#
##
#
#
#
#
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
##
#
#
##
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
##
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
x1
x2
What is fuzzy c-means (FCM)?
(III-3a) (III-3b)
FCM: missing elements
• Optimal number of clusters c*
• Optimal fuzziness amount m*
mc
FCM
Extended fuzzy c-means algorithm
• Step 1: Initialize the parameters related to fuzzy partitioning: c = 2 (2 ≤ c cmax), m = 1 (1 ≤ m mmax), where c is an integer, m is a real number; Fix minc where minc is incremental value of m ( 0 < minc ≤ 0.1); Fix cut-off threshold L; Choose validity index v
• Step 2: Given c and m, initialize U(0) so that it becomes the fuzzy matrix. Then at step l, l = 0, 1, 2, ….;
• Step 3: Calculate the c fuzzy cluster centers {vi(l)} with (III-3a) and U(l)• Step 4: Update U(l+1) using (III-3b) and {vi(l)}• Step 5: Compare U(l) to U(l+1) in a convenient matrix norm; if || U(l+1) – U(l) || ≤ L to
go step 6; otherwise return to Step 3.• Step 6: Compute the validity index for given c and m• Step 7: If c < cmax, then increase c c + 1 and go to step 3; otherwise go to step 8• Step 8: If m < mmax, then increase m m + minc and go to step 3; otherwise go to
step 9• Step 9: Obtain the optimal validity index from , optimal number of clusters c*, and
optimal amount of fuzziness exponent m*; The optimal fuzzy partition U is obtained given c* and m*
Cluster validity indices
2
1 1
( )( )
c n
iki k
uPC U
n
Partition coefficient
21 1
[ log ( )]( )
c c
ik iki k
u uPE U
n
Partition entropy
22
1 12
,
( )
min
n c
ik k i Ak i
XB
i j i j
u x vU
n v v
Xie-Beni index
2
1
1
11 1
2(2 ) /
1 1
( )
( )
nm
ik k ic Ak
ni
ikk
VI c cw w
ij j i Ai j
u x v
uS
z z
1
1
1ij w
cj i A
l j l Al j
z z
z z
1 2 1 1 2[ , ,...., , ] [ , ,...., , ]
1 1,1 1,
T Tc c cz z z z v v v x
i c j c j i
SVi indexwhere w is set to 2 in this study
• Selected validity indices are calibrated over the study set
Xie-Beni index is recommended as a validity indexAverage m* is 1.38
0
0.2
0.4
0.6
0.8
1
1.2
1.4
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of clusters c
Ind
ex
va
lue UXB
PC
PE
SVI/100
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Fuzziness amount mIn
dex
val
ue
UXB
SVI/100
Determining c* and m*
Histogram of m* for FCM
Methodology: illustration
Median home value of Buffalo, NY
Dimensionality of Buffalo housing market
Predictor Coefficient Standard Error t-statistics p-value
Constant -1455768 164417 -8.85 0.000
Per capita income 2.3667 0.2791 8.48 0.000
% college degree 88221 11346 7.78 0.000
% family: couple with children 65735 18775 3.50 0.001
% detached housing unit -31260 5527 -5.66 0.000
Housing age (year) 692.88 80.26 8.63 0.000
% non-hispanic white 11186 3914 2.86 0.005
% native born status 130039 31111 4.18 0.000
Job accessibility -0.05266 0.02227 -2.36 0.019
Hedonic regression equation of median home value in Buffalo, NY
Adjusted R sq = 84.3%
Optimal number of housing submarkets c*, Optimal fuzziness amount m*, Buffalo, NY
c m 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 0.4735 0.4570 0.4380 8.0983 10.4115 12.5478 14.4334 16.0634 17.4645 18.6721
3 0.4136 0.3889 0.3460 0.3385 10.7864 12.9137 14.7939 16.4217 17.8290 19.0553
4 0.7802 0.7116 0.6080 0.5241 1.3154 6.8837 7.4807 8.0441 8.5632 9.0391
5 0.5560 0.5622 0.5940 0.6121 0.4683 0.3404 0.6489 0.6850 0.7206 0.7555
6 0.6223 0.7578 1.0187 0.8173 0.6907 1.3393 1.4074 1.4819 1.5595 1.6382
7 0.8836 0.6903 0.6881 0.6016 0.6148 0.9515 2.4397 2.6306 2.8317 3.0383
8 0.5981 0.5888 0.5703 0.5232 0.3992 0.7381 0.8910 1.2388 1.2926 1.3538
9 0.9645 0.6160 0.4836 0.4866 0.8449 1.4020 1.4198 1.8317 1.8639 1.9161
10 0.7053 0.6004 0.6619 0.5873 0.5868 1.3465 1.5081 1.6875 1.8215 1.8591
c* 3 3 3 3 8 5 5 5 5 5
Values in the cell represent Xie-Beni index given c and m
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
ZPCINCOME ZCOLLEGE ZFAMCPCHL ZHUDETP ZYRHUBLT ZNHWHT_P ZNATIVEBR ZJOBACM
Attribute Vector
Clu
ste
r M
ea
n
Cluster 1
Cluster 2
Cluster 3
c* = 3; m* = 1.3
No Data
Membership degree to Cluster 10 - 0.10.1 - 0.20.2 - 0.30.3 - 0.40.4 - 0.50.5 - 0.60.6 - 0.70.7 - 0.80.8 - 0.90.9 - 1
Interstate Highway
(A)
Membership to Cluster 1
No Data
Membership degree to Cluster 20 - 0.10.1 - 0.20.2 - 0.30.3 - 0.40.4 - 0.50.5 - 0.60.6 - 0.70.7 - 0.80.8 - 0.90.9 - 1
Interstate Highway
(B)
Membership to Cluster 2
No Data
Membership degree to Cluster 30 - 0.0990.099 - 0.1970.197 - 0.2960.296 - 0.3950.395 - 0.4930.493 - 0.5920.592 - 0.6910.691 - 0.7890.789 - 0.8880.888 - 0.986
Interstate Highway
(C)
Membership to Cluster 3
No Data
Defuzzified Clusters123
Interstate Highway
(D)
Defuzzified Clusters
Buffalo housing submarkets
Evaluating the performance of fuzzy clustering
• Compare the sum of squared error derived from KM (m=1) and FCM (m=m*) given c*
Fuzzy clustering outperforms crisp clustering
Paired Samples Statistics
1026.546 85 3848.268377 417.4033
745.7332 85 3022.266891 327.8109
j2_hcm
j2_fcm
Pair1
Mean N Std. DeviationStd. Error
Mean
Paired Samples Test
280.8133 915.57126275 99.30765 83.32912 478.2974 2.828 84 .006j2_hcm - j2_fcmPair 1Mean Std. Deviation
Std. ErrorMean Lower Upper
95% ConfidenceInterval of the
Difference
Paired Differences
t df Sig. (2-tailed)
22
1 1
( )n c
ik k i Ak i
u x v
Compare FCM with K-means (KM)
Conclusions
• Fuzzy set theory provides a mechanism for uncertainty handling involved in classification task
• Fuzzy c-means algorithm is of practical use in delineating housing submarkets
• Fuzzy set theory needs further attention in social science fields
• More works on the choice of parameters are needed