Examining Activity Patterns Using Fuzzy Clustering by D De Silva, University of Calgary JD Hunt,...
-
Upload
patricia-johns -
Category
Documents
-
view
215 -
download
0
Transcript of Examining Activity Patterns Using Fuzzy Clustering by D De Silva, University of Calgary JD Hunt,...
Examining Activity Patterns Using Fuzzy
Clusteringby
D De Silva, University of CalgaryJD Hunt, University of Calgary
PROCESSUS Second International Colloquium
Toronto ON, CanadaJune 2005
Overview• Introduction
• Data
• Method
• Preliminary Results
• Conclusions
Introduction• Context
• Activity-based transport models increasing• Need for grouping into segments• At present seems largely based on received
wisdom
• Motivations
• Opportunity in Calgary• Large Household Activity Diary Survey• Interest in Activity-based model development• Willingness to explore issue of grouping
• Increase understanding of activity patterns resulting from behavioral processes
Introduction
• Previous work
• Fair amount of work drawing in essence on three basic elements
• Data interpretation
• Similarity or Dissimilarity Measures
• Pattern Recognition Algorithms
Introduction• Previous work (Contd.)
• Data Interpretation• Some used Time Slices in 5 to 15 minute intervals
(Recker et al; Wilson)• Others Disagreed with it and used number of stops
made. (Pas)
• Similarity or Dissimilarity Measures• Similarity Matrix (Pas;Wilson; Ma)
• Sequential Alignment Method (Wilson; Jun Ma)• Walsh-Hadamand transformation, a Fourier Type
Analysis, (Recker et al)
• Pattern Recognition Algorithms• All have used Crisp Clustering Methods
Introduction• Previous work (Contd.)
• Groups with similar activities• Pas – 12 groups based on the number of non-home stops• Recker – 7 Groups based on Socio Economic Data• Wilson – 8 groups Similar to Recker
• Applications• To Model Inter Shopping Duration (Bhat)• Micro simulation of Activity Patterns (Kitamura et al;
Kulkarni et al)
• Extension – the work described here• Time Slices• Sequential Alignment Method• Fuzzy Clustering
DataHousehold Activity Survey
(HAS)• 24-hour diary• Fall of 2001• Sample size
• 8,400 households overall• 5,900 on weekdays
• 15-minute intervals• activity• location
• Activities in 19 categories• Locations
• X,Y• Home, Work , Travel, Other
• All household members
Activities Covered in HAS
• Travel (A)• Pick Up Someone (B)• Drop Off Someone
(C)• Work (D)• School / Homework
(E)• Shopping (F)• Daycare (G)• Social (H)• Eating (J)
• Entertainment / Leisure (K)
• Medical / Financial (L)• Exercise (M)• Religious / Civic (N)• Sleeping (O) • Household Chores (P)• Park / Un-park Vehicle (X)• Work-Travel(e.g. Taxi Driver) (Y)• Out-of-Town (Z)
Example Sequence
• Activity Sequence of • 30 min Sleep• 15 min Eat • 30 min Travel• 1 hr Work
• O O J A A D D D D
Initial Sample for Testing
• Covered in this presentation • 75 persons• 50 households• Just activity type and weekdays (not
location & weekends)
• Later consider:• Full sample• Weekends and weekdays• Location types as a further dimension
Method
Dissimilarity Matrix
Groups of Similar Activity Patterns
Sequential Alignment Method
(CLUSTALG Software)
Data Set (Time Slices)
Fuzzy Cluster Memberships
Fuzzy Clustering(S-Plus
Software)
Cluster Center Interpretation
•Socio Economic Variable Distribution•Fuzzy Weighted Frequency Distributions
Sequential Alignment Method (SAM)
• Alignment Methods first used in field of Molecular Biology for DNA matching
• Activity Travel Patterns Intrinsically Sequential
• SAM Evaluation of Sequence of Characters• Global Alignment (Whole Sequence)• Local Alignment (Short sequence within entire
sequence)
• Simplest case is Pairwise alignment
Sequential Alignment Method
• Pairwise Alignment• Two Character Sequences
• ID 1: O O J A A D D D D• ID 2: O O O J A D D D O
• Elementary Operations until equal• Insertions and Deletions (Indel)• Gaps
• Gap insertion and extension Penalties
• Global Alignment – Needleman & Wunch algorithm minimizing the distance or maximizing the similarity
• ID 1: - O O J A A D D D D -• ID 2: O O O J A - D D D – O• Similarity Score = 70
• Lesser operations Similar Pair
• Gap Opening and Extension Penalties• Role of gap penalty• High Value
• Alignment compressed• Literally to matches avoiding gaping• Resemble main activities at their relative times• Recommended values 8 and 3 (Wilson)
• Low Value• Identification of similar activities displaced during the
day• Better pairwise comparison• Little similarity to the actual activity Pattern• Recommended values 1 and 0.1 (Wilson)
• Tested and accepted recommendation of Low Value for Transportation Research (Wilson)
Sequential Alignment Method
• Multiple Alignment
• Extension of pairwise alignment to N dimensions
• Computation power enormous after 10 sequences of reasonable length
• Approximation method based on data of pairwise alignment
• Use of ClustalG software by Wilson
Sequential Alignment Method
Sequential Alignment Method
• Output is a Dissimilarity Matrix
HH4104 HH4904 HH503 HH401 HH2103 HH2401
HH4104 0
HH4904 0.122 0
HH503 0.148 0.165 0
HH401 0.574 0.523 0.533 0
HH2103 0.553 0.5 0.511 0.224 0
HH2401 0.419 0.393 0.407 0.153 0.123 0
Fuzzy Clustering• Partition Clustering Method • Number of clusters k - specified in front• The Objects (Activity Patterns) are not
assigned to a particular cluster but assigned a membership ranging between 0 and 1 for all clusters
• Uses S-plus Software (Kaufman Procedure)
• Dissimilarity matrix is input
Fuzzy Clustering• Minimize Objective Function
(Kaufman)
cluster v object to of iMembership u
ElementMatrixityDissimilarjid
where
u
jiduu
thth
k
v ivu
k
vn
j jv
n
ji jviv
iv
…1, = ifor 1
1
k. ..,…1, = vandn ,…1, = ifor 0 ivu
11
2
1,
22
),(
n.,
2
),(
Fuzzy Clustering
• Number of clusters ?
• An Open question – To be determined as part of research
• Two quality indices from S-Plus
• Dunn’s Coefficient • Average Silhouette Value with Shadow plot
Fuzzy Clustering
• Dunn’s Coefficient
Where Fk always lies in the range [1/k,1].
• entirely Fuzzy Clustering
• Crisp Clustering
n
i
k
v
ivk n
uF
1 1
2
kFk
1 k
uiv
1
1or 0ivu1kF
Fuzzy Clustering• Average Silhouette Value (ASV) with Shadow plot
• Strength of Classification to the nearest crisp cluster compared to the next best cluster
• Width of Bar• 1 – Well Classified• 0 – Between two clusters• 0< - Badly classified (lies near the next best
cluster)
• Average Value gives a approximation to the best number of clusters
• ASV must be higher than 0.25
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette width
Average silhouette width : 0.4
Cluster Center Interpretation
• Distributions of socio-economic variables• Basis for grouping in subsequent modeling• Person characteristics:
• Age• Gender• Person type category from survey• Employment Status
• Household characteristics: attributed to persons• Only income so far• Household structure later
• Fuzzy weighted frequency distributions
• Need for eventual Crisp
• Potentially use logit to assign cluster membership values • Calibrate ‘utility functions’ for clusters with person
characteristics• Use Monte Carlo to select specific cluster in each case
Cluster Center Interpretation
• Fuzzy Weighted Frequency Distributions;
• Bar for category in histogram for cluster is Percentage sum of people for that category in entire sample factored by cluster membership
M F M F M F
HH1503 0.4665 0.3907 0.1428 F 0 0.4665 0 0.3907 0 0.1428
HH1504 0.4618 0.3587 0.1795 F 0 0.4618 0 0.3587 0 0.1795
HH1801 0.4511 0.3094 0.2395 M 0.4511 0 0.3094 0 0.2395 0
HH2102 0.4197 0.3927 0.1876 M 0.4197 0 0.3927 0 0.1876 0
HH2503 0.5391 0.3234 0.1375 M 0.5391 0 0.3234 0 0.1375 0
HH2504 0.5208 0.3346 0.1447 M 0.5208 0 0.3346 0 0.1447 0
2.8590 2.1095 1.0315 1.9307 0.9283 1.3601 0.7494 0.7092 0.3223
68% 32% 64% 36% 69% 31%
Individual ID
Cluster Membership
Gender Cluster 1 Cluster 2 Cluster 3
Fuzzy Gender Distribution
C1 C2 C3 Fuzzy Weighted Frequency Distribution
0%
20%
40%
60%
80%
M F
Cluster 1
Cluster 2
Cluster 3
Results
• Sequential Alignment• Low Vs High Gap Penalty Results
• Cluster plot for 3 clusters
Low Gap High Gap
Component 1
Com
pone
nt 2
-0.4 -0.2 0.0 0.2
-0.4
-0.2
0.0
0.2
These two components explain 40.11 % of the point variability.Component 1
Com
pone
nt 2
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
These two components explain 33.75 % of the point variability.
Results
• Use low Gap Penalty – consistent with recommendation (1 and .1)
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette width
Average silhouette width : 0.4
• Shadow PlotLow Gap High Gap
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette width
Average silhouette width : 0.3
Co efficient Low Gap High Gap
Dunn’s Co-efficient 0.4 0.33
Average Silhouette Value
0.4 0.3
Results
• Number of Clusters
• Clustal Plot Helps to See the potential range of number of clusters for Clustering
0.1
HH2701
HH503
HH4104
HH4904
HH505
HH905
HH2503
HH504
HH3603
HH903HH904HH4103
HH2504
HH3002HH3003
HH3703
HH4003
HH4903
HH3101HH3702
HH506
HH4004
HH3902
HH4602HH1603
HH3402HH802HH803
HH1901
HH4701
HH1502
HH2402
HH1504
HH3706
HH3403
HH603 HH3704HH3705HH1503
HH3903HH1801HH2102
HH501
HH601
HH902
HH2401
HH401
HH2103
HH3401HH403
HH3602
HH502HH901
HH3001HH1604
HH1601
HH1602
HH4101HH602
HH3601
HH2501
HH2502
HH4901
HH402
HH3701
HH1501
HH1902HH801HH4601
HH4001HH4002HH4102HH3901HH4902
HH2101
Results
• Number of Clusters
• Potential range 2 to 5
0.1
HH2701
HH503
HH4104
HH4904
HH505
HH905
HH2503
HH504
HH3603
HH903HH904HH4103
HH2504
HH3002HH3003
HH3703
HH4003
HH4903
HH3101HH3702
HH506
HH4004
HH3902
HH4602HH1603
HH3402HH802HH803
HH1901
HH4701
HH1502
HH2402
HH1504
HH3706
HH3403
HH603 HH3704HH3705HH1503
HH3903HH1801HH2102
HH501
HH601
HH902
HH2401
HH401
HH2103
HH3401HH403
HH3602
HH502HH901
HH3001HH1604
HH1601
HH1602
HH4101HH602
HH3601
HH2501
HH2502
HH4901
HH402
HH3701
HH1501
HH1902HH801HH4601
HH4001HH4002HH4102HH3901HH4902
HH2101
Results• Number of Clusters (k)
• K=2
• Fk = 0.60 ASV = 0.42
Component 1
Com
po
nen
t 2
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
0.4
These two components explain 33.75 % of the point variability.
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette width
Average silhouette width : 0.42
Results• Number of Clusters (k)
• K=3
• Fk = 0.43 ASV = 0.40
Component 1
Co
mp
on
ent
2
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
These two components explain 33.75 % of the point variability.
-0.2 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette width
Average silhouette width : 0.4
Results• Number of Clusters (k)
• K= 4
• Fk = 0.34 ASV = 0.32
Component 1
Co
mp
on
en
t 2
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
These two components explain 33.75 % of the point variability.
-0.5 0.0 0.5 1.0
Silhouette width
Average silhouette width : 0.32
Results• Number of Clusters (k)
• K= 5
• Fk = 0.28 ASV = 0.20
Component 1
Com
po
nen
t 2
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
These two components explain 33.75 % of the point variability.
-0.5 0.0 0.5 1.0
Silhouette width
Average silhouette width : 0.2
Results• Number of Clusters (k) ?
• Use 3 clusters for testing
• Expect different for total sample
2 Clusters
3 Clusters
4 Clusters
5 Clusters
Fk 0.60 0.43 0.34 0.28
ASV 0.42 0.40 0.32 0.20
Fuzzy Cluster Memberships
• Output of S-plus software
• HH2701 has almost equal memberships to all three clusters -
Person ID
Crisp Cluster C1 C2 C3
Person ID
Crisp Cluster C1 C2 C3
HH1501 3 0.1118 0.1145 0.7737 HH3902 2 0.3065 0.5194 0.1741HH1502 2 0.3406 0.4965 0.1628 HH3903 1 0.4534 0.4175 0.1291HH1503 1 0.4665 0.3907 0.1428 HH4001 3 0.1598 0.1669 0.6732HH1504 1 0.4618 0.3587 0.1795 HH4002 3 0.1311 0.1343 0.7346HH1601 3 0.2210 0.2055 0.5735 HH4003 1 0.4774 0.3300 0.1925HH1602 3 0.2728 0.2597 0.4675 HH4004 2 0.3372 0.4198 0.2431HH1603 2 0.2940 0.5723 0.1338 HH401 3 0.2625 0.2978 0.4396HH1604 3 0.2752 0.2475 0.4773 HH402 3 0.2372 0.2238 0.5390HH1801 1 0.4511 0.3094 0.2395 HH403 3 0.0236 0.0266 0.9498HH1901 2 0.3366 0.4470 0.2164 HH4101 3 0.2425 0.2451 0.5124HH1902 3 0.1838 0.1768 0.6394 HH4102 3 0.0689 0.0699 0.8611HH2101 3 0.1804 0.1883 0.6313 HH4103 1 0.5112 0.3297 0.1591HH2102 1 0.4197 0.3927 0.1876 HH4104 1 0.4189 0.3856 0.1955HH2103 3 0.2423 0.2359 0.5218 HH4601 3 0.1783 0.1897 0.6321HH2401 3 0.2132 0.2344 0.5524 HH4602 2 0.2625 0.5847 0.1528HH2402 2 0.3368 0.5158 0.1474 HH4701 2 0.3343 0.5097 0.1560HH2501 3 0.1228 0.1401 0.7372 HH4901 3 0.1349 0.1447 0.7205HH2502 3 0.2253 0.2470 0.5277 HH4902 3 0.1452 0.1658 0.6890HH2503 1 0.5391 0.3234 0.1375 HH4903 1 0.5106 0.3130 0.1763HH2504 1 0.5208 0.3346 0.1447 HH4904 2 0.3916 0.4309 0.1775HH2701 2 0.3407 0.3412 0.3181 HH501 2 0.3251 0.3753 0.2996HH3001 3 0.2563 0.2346 0.5092 HH502 3 0.2047 0.2015 0.5938HH3002 1 0.5152 0.3272 0.1577 HH503 1 0.3978 0.3601 0.2421HH3003 1 0.5384 0.3078 0.1538 HH504 1 0.5740 0.3004 0.1256HH3101 2 0.3258 0.4150 0.2592 HH505 1 0.3750 0.3435 0.2815HH3401 3 0.1330 0.1351 0.7319 HH506 2 0.3758 0.4553 0.1689HH3402 2 0.3073 0.5744 0.1183 HH601 2 0.2976 0.3905 0.3120HH3403 1 0.4152 0.3697 0.2150 HH602 3 0.1633 0.1670 0.6697HH3601 3 0.2416 0.2391 0.5194 HH603 2 0.3796 0.3873 0.2331HH3602 3 0.2240 0.2061 0.5698 HH801 3 0.1589 0.1695 0.6715HH3603 1 0.4916 0.3428 0.1656 HH802 2 0.2771 0.6039 0.1190HH3701 3 0.1898 0.1805 0.6297 HH803 2 0.2771 0.6039 0.1190HH3702 2 0.3656 0.4784 0.1560 HH901 3 0.2316 0.2277 0.5406HH3703 1 0.4717 0.3396 0.1886 HH902 3 0.2047 0.2205 0.5748HH3704 1 0.4291 0.3709 0.2000 HH903 1 0.4616 0.3467 0.1918HH3705 1 0.4291 0.3709 0.2000 HH904 1 0.6005 0.3023 0.0973HH3706 1 0.3972 0.3268 0.2760 HH905 1 0.5167 0.3764 0.1069HH3901 3 0.1108 0.1125 0.7767
ResultsFuzzy weighted frequency Distribution
Gender Distribution
0%
10%
20%
30%
40%
50%
60%
70%
80%
M F
Gender
Cluster 1Cluster 2Clluster 3
Age Distribution
0%
5%
10%
15%
20%
25%
30%
0 -
56
- 10
11 -
15
16 -
20
21 -
25
26 -
30
31 -
35
36 -
40
41 -
45
46 -
50
51 -
55
56 -
60
61 -
65
66 -
70
71 -
75
>75
Age
Pe
rce
nta
ge
(%
)
Cluster 1
Cluster 2Cluster 3
Person Category Distribution
0%
10%
20%
30%
40%
50%
60%
70%
KEJ
S
SH
S
PSS
AW
NN
C
AW
NC
AO
Sen YO
Person Category
Pe
rce
nta
ge
(%
)
Cluster 1
Cluster 2
Cluster 3
Age Distribution
0%
5%
10%
15%
20%
25%
30%
0 -
56
- 10
11 -
15
16 -
20
21 -
25
26 -
30
31 -
35
36 -
40
41 -
45
46 -
50
51 -
55
56 -
60
61 -
65
66 -
70
71 -
75
>75
Age
Pe
rce
nta
ge
(%
)
Cluster 1
Cluster 2
Cluster 3
Employment Status
0%
10%
20%
30%
40%
50%
60%
70%
80%
Emp
Self_E
mp
Un_Em
pl
Retrd
Home_
MVolu
nStu
dt
Employment Status
Pe
rce
nta
ge
(%
)
Cluster 1
Cluster 2
Cluster 3
ResultsCluster Interpretation
Annual Houshold Income
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Refused
25000-35000
45000-55000
65000-75000
100000-125000
>150000
Annual Household Income ( Cad $)
Pe
rce
nta
ge
(%
)
Cluster 1
Cluster 2
Cluster 3
Cluster 1 Cluster 2 Cluster 3
Male 64% 21% 61%Female 36% 79% 39%
KEJS 56% 0% 0%SHS 0% 5% 0%PSS 8% 5% 0%AWNNC 4% 11% 90%AWNC 0% 16% 10%AO 0% 21% 0%Sen 4% 26% 0%YO 28% 16% 0%
Employed 4% 18% 90%Self Employed 0% 12% 6%Unemployed 0% 0% 0%Retired 0% 12% 3%Homemaker 4% 18% 0%Volunteer 0% 24% 0%Student 92% 18% 0%
12 39 39%Avg. Age
Parameter
Gender
Person Type
Employment Status
Crisp presentation
ResultsCluster Interpretation - tends to be more;
• Cluster 1• Students age of 5 to 15• Mainly KEJS and youths
• Cluster 2• Females• Seniors and other adults in Age range 66-70• Retired home makers and volunteers
• Cluster 3• Males• 100% Adults workers • Age 40’s• Majority Adults workers not needing a car to work
• Expect different for total sample
Conclusions
• Methods seems to work well to identify the clusters as intended – no hurdles.
• Fuzzy clustering better indicate strength of membership
• Best to have multiple measures “quality” of clustering regarding number of clusters
• Still work in progress• Results not complete – just for example
• But essential elements of analysis process set
Conclusions
• Future Work
• Proceeding to full sample of 8,400 households including Weekends
• Expanding to location dimension
• Calibrate Logit model for allocation of clusters
• Consider Household Structure
Thank You?