American Clusters Classification Methodology

26
American Clusters Geodemographic Classification METHODOLOGY

Transcript of American Clusters Classification Methodology

Page 1: American Clusters Classification Methodology

American Clusters Geodemographic Classification

METHODOLOGY

Page 2: American Clusters Classification Methodology

Geodemographics

Birds of a feather flock together (English proverb).

Geodemographics is the analysis of people by where they live (Sleight, 1996).

Geodemographic classification categorizes neighborhoods based on their socio-economic and lifestyle characteristics.

Page 3: American Clusters Classification Methodology

Applications of Geodemographic Classifications

Commercial Non-for-ProfitMarket ResearchSite SelectionTrade Area AnalysisDirect MarketingAdvertising ManagementMedia AnalysisElections

Public SectorHealth Care

EducationLocal Authority

PolicingAcademic

Poverty PreventionCharities

Source: adapted from: •Harris et al. 2005•OAC

•Bolton Council Planning Research

Commercial – People who live in an area where there is...

Non-for-profit – People who live in an area where there is...

•A high probability of buying a particular type of newspaper.

•A high application rate for consolidation loans.

•A high consumption of fashion goods.

•Low grades in secondary school exams.

•High risk of developing diabetes.

•High fear of crime, but low crime levels

Page 4: American Clusters Classification Methodology

Open Geodemographics

Term “Open Geodemographics” was first used and applied by Dr. Dan Vickers, the author of “Multi-level Integrated Classification Based on the 2001 Census”. In this paper Dr. Vickers describes methods and techniques of building the National Classification of Census Output Areas (UK).

Dr. Vickers has uploaded methodology and results of his work online for review and free download.

To find out more about Open Geodemographics please watch this presentation.

Page 5: American Clusters Classification Methodology

Project Objectives

Are to create open and free geodemographic classification of USA using Census 2000 results following the methodology developed by Dr. Dan Vickers.

Page 6: American Clusters Classification Methodology

Selection of cluster objects (operational taxonomic

units) Variables selection

Variables standardization

Clustering method selection Identification of cluster number

Interpretation, testing and mapping of

clusters

The classification building process was largely based on the methodology elaborated by Daniel Vickers in “Multi-level Integrated Classification Based on the 2001 Census (2006)”. Other methodologies were also considered such as the methodology of building MOSAIC described in “Geodemographics, GIS and Neighborhood Targeting” R. Harris, P. Sleight, R. Webber, Wiley (2005).

Project Methodology

Page 7: American Clusters Classification Methodology

Classification Inputs

208,000 Census block groups of all 50 states

281,000,000 Overall population

106,000,000 Households

Page 8: American Clusters Classification Methodology

Data and Variables Data US Census 2000 - the major source of data for the classification

Variables Some researchers such as Harris and Webber suggest the inclusion of as many variables as possible to create “more meaningful clusters” (Harris et al, 2005) .

However another opinion exists that the minimum number of variables should be used for analysis in order to prevent data redundancy and collinearity (researchers Everitt and Vickers)(Vickers, 2006).

In our classification we will try to use only “necessary” variables to avoid data redundancy and collinearity.

Page 9: American Clusters Classification Methodology

Variables Selection1. Male2. Female3. Urban 4. Rural 5. White 6. Black 7. Native 8. Asian 9. Hawaiian 10. Some other race 11. Two or more races 12. Aged 0-413. Aged 5-1414. Aged 15-2415. Aged 25-4416. Aged 45-6417. Aged 65+18. Foreign born 19. Not a citizen 20. Spanish language households21. Other Eurp languages households22. Asian language households23. One person households (under 65)24. One person households (over 65)25. One parent households26. Family households with no children 27. Non family households

28. Married29. Divorced30. Widowed31. Never Married32. Occupied by renters33. Median Rent34. Avg. house size35. One bedroom36. Four bedrooms37. Gas38. Bottled Gas Kerosene39. Wood40. No fuel41. Median year house build42. Median house value43. Occupied more 1.5 person per room44. Occupied less 0.5 person per room45. lacking complete plumbing facilities46. lacking complete kitchen facilities47. Households without a mortgage48. Second mortgage or home equity49. Owner cost with mortgage 50. Owner cost without mortgage 51. Percent of monthly cost (with

mortgage) exceeds 4052. Percent of monthly cost (with

mortgage ) does not exceed 10

53. Percent of monthly cost (without mortgage) exceeds 25

54. High School Degree55. Higher Education Attainment 56. No car households57. 2+ car households58. Work at home59. Car to work60. Carpool to work61. Public Transport to Work62. Bicycle or Walk to Work63. Long commuters (60+min to work)64. Short Commuters (less 15 min to work)65. To work from 12 to 5am66. To work from 8 to 10 am67. To work from 4 pm to 11:59 pm68. Standardised Disability Ratio69. Unemployed70. Working part-time71. People living below poverty line72. Retirement Income73. Public Assistance Income74. Supplemental Security Income75. Social Security Income for HH76. Interest, Dividends or Net rental income

for HH77. Median HH income

List of initial 77 variables

some of them had to be removed...

Page 10: American Clusters Classification Methodology

Variables SelectionThe initial list of variables was reviewed and reduced by applying three analytical tools:

• Principal Component Analysis (PCA)/Factor analysis: variables with high factor loadings were selected

• Correlation Matrix: pairs of variables strongly correlated with each other were examined and only one within each pair was left for analysis

• Standard Deviation Evaluation: variables with low SD were not included because they vary little between block groups and do not bring value to the clustering process.

Page 11: American Clusters Classification Methodology

Variables Selection

1.Urban population2.Black 3.Asian4.Aged under 55.Aged 5-146.Aged 15-247.Aged 25-448.Aged 45-649.Aged 65+10.One person households (under 65)11.One person households (over 65)12.One parent households13.Households with no children 14.Foreign born population15.Spanish language households16.Other Eurp languages households17.Never married18.Occupied by renters19.Median rent20.One bedroom21.Four bedrooms22.Bottled gas, oil, kerosene as fuel23.Median house value24.Occupied 1.5 person per room25.Households Without a mortgage

26.Second mortgage or home equity27.Monthly cost (with mortgage ) less 10%28.Monthly cost (with mortgage ) more 40%29.Higher education30.Two cars households31.Car to work32.Public transport to work33.Bicycle or walk to work34.Long commuters (60+min to work)35.To work from 8 to 10 am36.Standardized Disability Ratio37.Unemployed38.Working part-time39.People living below poverty line40.Retirement income41.Public assistance income42.Social security income 43.Interest, dividends or net rental income 44.Median household income45.Government workers46.Self-employed47.Agriculture, forestry, fishing and hunting and mining

List of variables was reduced from 77 to 47 variables

Page 12: American Clusters Classification Methodology

Variable Transformation Variable transformation and standardization allowed for the inclusion of different types of data (e.g. percent of black population, medium no. of rooms, medium household income)

Log transformation

By transforming the data to a log (logarithmic) scale the problem of very high value outliers was greatly reduced as the difference between values at the extremities of the data set was reduced by more than those more typical average values. (Methods for Area classification for output areas, National Statistics, UK)http://www.statistics.gov.uk/about/methodology_by_theme/area_classification/oa/methodology.asp

Page 13: American Clusters Classification Methodology

Variables TransformationZ-score standardization

Z-score standardization is the most popular method of data standardization, but in our project it showed poor results because highly skewed variables were given too much weight. This could result in forming of wrong clusters in the clustering process.

Range standardization

To reduce the effect of highly skewed variables the method of range standardization was chosen. This method allows for the data to be standardized in the range between 0 (minimum value) and 1 (maximum value).

Page 14: American Clusters Classification Methodology

Clustering Process

K-means

To create the classification k-means method (performed in SPSS) was chosen as it works well on large datasets (Vickers, 2006)The major issue with k-means is that the number of clusters has to be specified beforehand To determine the number of clusters 2 tests were performed: Test 1 Evaluation of an average distance to the cluster centerTest 2 Cluster size range assessment

Page 15: American Clusters Classification Methodology

Test1. Evaluation of average distance to cluster center

It was suggested that the most useful number of clusters for the classification would be around 6 (Vickers, 2006). So the target number of k-means clusters would be found between 4 and 8. We needed to find the solution with the most significant increase in average distance from the cluster center. The most significant increase between two consecutive solutions was found at the point of 6 cluster solution.

Page 16: American Clusters Classification Methodology

Test 2. Cluster size range

Another important factor which was considered was the size of clusters: the more homogenous are clusters in terms of number of members the better. Mean range between optimal cluster size (all clusters are equal in number of members) and actual sizes of clusters for a given solution – the lower range the better.

Page 17: American Clusters Classification Methodology

Number of Clusters

7 cluster solution was chosen

4 cluster solution worked good in both tests, but was outperformed by 5 and 7cluster solution which worked better in the second test. 6 cluster solutions showed poor results in the second test while 8 cluster solution didn’t pass the first test well.

Page 18: American Clusters Classification Methodology

Group of Clusters

These 7 clusters form a highest level of classification hierarchy. They represent the Group of Clusters.

To build the second level each resulted cluster was split into number of smaller clusters by using the same methods we applied before.

As the result we’ve got 18 distinguished clusters for the American Clusters Classification.

Page 19: American Clusters Classification Methodology

Clusters Hierarchy

1st level of classification hierarchy(Group of Clusters) 2nd level of classification hierarchy 18 clusters

Page 20: American Clusters Classification Methodology

Analyzing ClustersNow 18 identified clusters are ready to be analyzed

All mean values of clusters were compared with the dataset mean values.

Page 21: American Clusters Classification Methodology

Describing CLustersCluster 5.1- Upscale Couples

“Significant share of these group members is self-employed’

“Those who work prefer to use personal vehicle to get to work, and majority of them leave home between 8-10am.”

Based on the comparison clusters were analyzed and described

Then clusters were mapped and named

“Upscale Couples” is a cluster of rich people with large share of men and women between 45 and 65 who live in suburban areas within the vicinity of large US cities.

“Their incomes are two times higher than the US average”

http://www.flickr.com/photos/loungerie/3029049309/http://www.flickr.com/photos/whartz/1066907283/

Image Sources:

Page 22: American Clusters Classification Methodology

Mapping ClustersMapping clusters using free tools from Google

Page 23: American Clusters Classification Methodology

Final Clusters

Page 24: American Clusters Classification Methodology

Clusters Description

Depressed Blocks

Low Income Families

Established Suburbs

Settled Achievers

Suburban Middles

Rural Despair

Rural Communities

Unfortunate Countryside

Retired Citizens

Satellite City Young Families

Small Town Communities

Upscale Couples

Prosperous B Boomers

Successful Families

Country Life

Farmers' Land

Multicultural Communities

Hispanic Families

To get cluster description just click on one of the links below

Page 25: American Clusters Classification Methodology

References Callingham, M. (2005), From areal classification to geodemographics, paper presented at the Demographic User Group Conference, Royal Society, London 10th November 2005.

Debenham, J. E. (2002), Understanding Geodemographic Classification: Creating The Building Blocks For An Extension, Working Paper 02/1 School of Geography, University of Leeds [online] http://www.geog.leeds.ac.uk/wpapers/02-1.pdf

Harris, R., Sleight, P. and Webber, R. (2005), Geodemographics, GIS and Neighbourhood Targeting, London, Wiley

Longley, P. A. (2005), Geographical Information Systems: a renaissance of geodemographics for public service delivery, Progress in Human Geography, 29(1)

Sleight, P. (2004) Targeting customers: How to Use Geodemographic and Lifestyle Data in Your Business, Henley-on –Thames, World Advertising Research Centre

Vickers, D. (2006) , Multi-level Integrated Classification Based on the 2001 Census, The University of Leeds

Webber, R. and Farr, M. (2001) , MOSAIC-From an area classification system to household classification, Journal of Targeting, Measurement and Analysis for Marketing,10(1).

Page 26: American Clusters Classification Methodology

To find more please visit World Clusters.org