The Use of Affinity Propagation to Cluster Socio-economic ... · analysis of large data sets. •...
Transcript of The Use of Affinity Propagation to Cluster Socio-economic ... · analysis of large data sets. •...
Ryan A. Meier
Central Michigan University
2015 IMAGIN Student Paper Competition
The Use of Affinity Propagation to Cluster Socio-economic Census Data
• Socioeconomic characteristic
• Define the people of the community
• Population
• Age
• Gender
• Race Identity
• Education
• Family Size
• Household Size
• Employment Sector
• Income
• Marital status
• Nativity
• Language Spoken
WHAT COMMUNITIES BEST REPRESENT THE UNITED STATES?
• Better understand the
heterogeneity of the U.S.
• Ideal sample sites
• New products
• Government policies
• Social studies
• Few geographic locations Large geographic region
PREVIOUS STUDIES
Lynd and Lynd (1929) conducted a study to explore the lifestyles within
a customary American community.
Selected Study Site Criteria for location
• U.S. Census Data
• Personal Observation
• Weather
• Population Growth
• Industry
• Artistic Life
• Data mining involves the analysis and extraction of
patterns previously unknown due to the size of the dataset (Kopanakis and Theodoulidis, 2003).
• Developments in data mining and statistical algorithms are
creating solutions to the problem of dealing with large data sets.
DATA MINING & LARGE DATA SETS
PREVIOUS STUDIES cont.
Geodemographics:
is a subfield which uses certain social, economic, and behavioral statistics to create a prediction model for marketers to identify geographic areas that are best suited for their product (Goss, 1995).
• John Robbin (PRIZM)
- 40 lifestyle clusters
- Effective marketing plan
• Michael Weiss
- The Clustering of America
- Cultural aspects
“Tell me someone’s zip code, and I can predict
what they eat, drink, drive—even think.” -Jonathan Robbin
PREVIOUS STUDIES cont.
• Dante Chinni and James
Gimpel (2010)
- Different “types” of
America
- Social, economic, and
political differences
- U.S. Census data
- County level
- Principal components
factor analysis
- Manually classified
borderline counties
Our Patchwork Nation
OBJECTIVE
• Map U.S. census socio-demographic data using affinity propagation to group zip codes into meaningful clusters.
• Identify exemplar locations of the U.S. to be used as ideal sample sites in future research.
• Combine GIS techniques with a novel statistical analysis.
• Demonstrate an objective method for the generalization and analysis of large data sets.
• Frey Labs, University of Toronto
• Clustering algorithm that expresses exemplars
− Most representative data point in the cluster
• Considers all data points as exemplars
• Parameters:
− Dissimilarity Matrix
− Preference Value
• Previous Studies
AFFINITY PROPAGATION
A visualization of AP cluster
classification with exemplar data points
(Bodenhofer, 2013 p. 3).
• Frey Labs, University of Toronto
• Clustering algorithm that expresses exemplars
− Most representative data point in the cluster
• Considers all data points as exemplars
• Parameters:
− Dissimilarity Matrix
− Preference Value
• Previous Studies
− Netflix movies
AFFINITY PROPAGATION
A visualization of AP cluster
classification with exemplar data points
(Bodenhofer, 2013 p. 3).
• Frey Labs, University of Toronto
• Clustering algorithm that expresses exemplars
− Most representative data point in the cluster
• Considers all data points as exemplars
• Parameters:
− Dissimilarity Matrix
− Preference Value
• Previous Studies
− Netflix movies
− Faces
AFFINITY PROPAGATION
A visualization of AP cluster
classification with exemplar data points
(Bodenhofer, 2013 p. 3).
• Frey Labs, University of Toronto
• Clustering algorithm that expresses exemplars
− Most representative data point in the cluster
• Considers all data points as exemplars
• Parameters:
− Dissimilarity Matrix
− Preference Value
• Previous Studies
− Netflix movies
− Faces
− American Landscapes
AFFINITY PROPAGATION
A visualization of AP cluster
classification with exemplar data points
(Bodenhofer, 2013 p. 3).
• Cardille and Lambois (2010)
• Objectively identify signature landscapes of the U.S.
• Aid in ecosystem management
• Land use/ Land cover satellite imagery
• 17 distinct landscapes
• Interesting insight into exemplar landscapes
− Human signature in almost every exemplar
FROM THE REDWOOD FOREST TO THE GULF STREAM WATERS: HUMAN SIGNATURE NEARLY
UBIQUITOUS IN REPRESENTATIVE US LANDSCAPES
METHODS
• 2010 U.S. Census Data and 2008-2012 five year estimate American Community Survey (ACS)
• Zip Code Tabulation Areas (ZCTAstm)
• 40 different attributes:
Population density, age, gender, race identity, educational attainment, family size, household size, employment sector, income, marital status, nativity and place of birth, and language spoken at home.
Download Data
from U.S. Census
Bureau
METHODS
• Null data values replaced or removed.
• The z-score for each variable was calculated to standardize the
dataset.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
METHODS
• Reduced data size by 35% while maintaining 95% of information.
• Eliminated correlated data.
• Increased RAM efficiency.
• Decreased overall running time.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
METHODS
• Matrix of how ‘different’ each point is from every other point.
• Size of matrix is exponential to the number of data points.
• 1,052,418,481 pairwise dissimilarities for 32,441 ZCTAs.
• Negative weighted Euclidian distance between z-scores in n-dimensional space.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
Create a Dissimilarity
Matrix from PCA
Results − 𝑤𝑗 𝑥𝑗 − 𝑦𝑗
𝐽
𝑗=1
METHODS
• R package APCluster (Bodenhofer et al., 2011).
• Approximate run time was 20 hours using 50GB RAM on a 3.5 GHz
current Xeon processor.
• AP runs only on one processor.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
Create a Dissimilarity
Matrix from PCA
Results
Run AP Using
Dissimilarity Matrix
and Preference Value
from Preliminary Run
METHODS
• Mapped using ArcMap 10.2.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
Create a Dissimilarity
Matrix from PCA
Results
Run AP Using
Dissimilarity Matrix
and Preference Value
from Preliminary Run
Map Results
REFERENCES
Anderson, M. J. 1988. The American Census: A Social History. New Haven, CT: Yale University Press.
pp. 17-31.
Cardille, J. A., and Lambois, M., 2010. From The Redwood Forest to the Gulf Stream Waters: Human
Signature Nearly Ubiquitous in Representative US Landscapes. In: Frontiers in Ecology
and the Environment, 8(3): pp. 130-134.
Chang, C-J., and Shyue, S-W., 2009. A Study on the Application of Data Mining to Disadvantaged
Social Classes in Taiwan’s Population Census. In: Ecpert Systems with Applications,
36(1): pp. 510-518.
Chinni, D., and Gimpel, J., 2010. Our Patchwork Nation. New York, USA: Penguin Group Inc.
Dueck, D., and Frey, B. J., Feb. 2007. Clustering by Passing Messages Between Data Points. In:
Science 315: pp. 972-76.
Dueck, D., and Frey, B. J., Oct. 2007. Non-Metric Affinity Propagation for Un-Supervised Image
Categorization. In: Proceedings, 11th IEEE International Conference, Rio de Janeiro,
Brazil, Computer Vision, pp. 1-8.
Fan, B., 2009. A Hybrid Spatial Data Clustering Method for Site Selection: The Data Driven Approach
of GIS Mining. In: Expert Systems with Applications, 36(2 part II): pp. 3923-3936.
Furse, D. H., Punj, G. N., and Stewart, D. W., 1984. A Typology of Individual Search Strategies Among
Purchasers of New Automobiles. In: Journal of Consumer Research, 10(4): pp. 417-431.
REFERENCES cont.
Goss, J., 1995. “We Know Who You Are and We Know Where You Live”: The Instrumental Rationality
of Geodemographic Systems. In: Economic Geography, 71(2): pp. 171-198.
Green, P. E., Frank, R. E., and Robinson, P. J., 1967. Cluster Analysis in Test Market Selection. In:
Management Science (pre-1986), 13(8): pp. B387 (14).
Hanson, Sandra L., 2004. Classic Book Reviews: The Past Revived. In: Journal of Marriage and
Family 62, 3: pp. 847-49.
Karimipour, F., Delavar, M. R., and Kinaie, M., 2005. Water Quality Management Using GIS Data
Mining. In: Journal of Environmental Informatics, 5(2): pp. 61-72.
Keim, D. A., Panse, C., and Sips, M., 2003. Visual Data Mining of Large Spatial Data Sets. In:
Databases in Networked Information Systems, 2822: pp. 201-215.
Keim, D. A., Panse, C., Sips, M., and North, S. C., 2004. Pixel Based Visual Data Mining of Geo-spatial
Data. In: Computers & Graphics, 28: pp. 327-344.
Kopanakis, I., and Theodoulidis, B., 2003. Visual Data Mining Modeling Techniques for the
Visualization of Mining Outcomes. In: Journal of Visual Languages & Computing, 14(6):
pp. 543-589.
Lynd, Robert S., and Lynd Helen M., 1929. Middletown. New York, USA: Harcourt, Brace & World, I
nc. pp. 3-9.
REFERENCES cont.
Mennis, J., and Guo, D., 2009. Spatial Data Mining and Geographic Knowledge Discovery—An
introduction. In: Computers, Environment and Urban Systems, 33(6): pp. 403-408.
Murray, C., Kulkarni, S., Michaud, C., Tomijima, N., Bulzacchelli, M., Iandiorio, T., and Ezzati, M., 2006.
Eight Americas: Investigating Mortality Disparities Across Counties, and Race-Counties in
the United States. In: PLoS Medicine, 3(9): pp. 1513-1524.
Slocum, T., McMaster, R., Kessler, F., and Howard, H., 2009. Data Classification. In: Thematic
Cartography and Geovisualization 3rd, Upper Saddle River, NJ: Pearson Education Inc.
pp. 57-75.
Spielman, S. E., and Thill, J-C., 2008. Social Area Analysis, Data Mining, and GIS. In: Computers,
Environment and Urban Systems, 32(2): pp. 110-122.
Weiss, M. J., 1988. The Clustering of America. New York, USA: Harper & Row, Pubishers.
Winkle, K., 1991. The U.S. Census as a Source in Political History. In: Social Science History, 15(4): pp.
565-57.