8/4/2019 Cluster Analysis-Marketing Reasearch
1/46
Business Research Methods
Cluster Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
2/46
Cluster Analysis
It is a multivariate interdependence analysis that
classifies individuals or objects into a small numberof mutually exclusive & exhaustive groups ,ensuringthat there is as much likeness within groups and asmuch difference among groups.
Cluster analysis differs from multiple discriminantanalysis in that the groups are not predefined. Thepurpose of cluster analysis is to determine how
many distinct groups exist and to define theircomposition. It does not predict relationships & isnot a dependence technique
8/4/2019 Cluster Analysis-Marketing Reasearch
3/46
Uses of Cluster Analysis It is a multivariate interdependence procedure ideally suited to
segmentation application in marketing Cluster by definition is a group of similar objects
Segmentation involves identifying groups oftarget customers who aresimilar in buying habits , demographic characteristics orpsychographics
There could be clusters of brands similar to each other & differentfrom other clusters( Soaps on the basis of various characteristics )
Cluster analysis is also used in selection of test market cities that aresufficiently similar so that no extraneous variation can cause
difference between the experimental & control markets.( population,retail sales, number of retail outlets, proportion of various incomegroups etc)
8/4/2019 Cluster Analysis-Marketing Reasearch
4/46
Methods Cluster Analysis Basic Methods of 2 types
1. Hierarchical or Linkage methods
2. Non_ hierarchical or Nodal Methods or k-means clustering
In the first method we do not have to specify in advance howmany clusters are to be extracted .A range of solutions isprovided by computers ranging from 1 to n-1 clusters where n is
number of objects being studied In second method Number of clusters to be extracted is to bespecified in advance
Both methods use a distance Measure to calculate distancesbetween objects in terms of its various characteristics or
variables. Most commonly used measure is Euclidean distance. TheEuclidean distance is the square root of the sum of the squareddifferences in values for each variable .
8/4/2019 Cluster Analysis-Marketing Reasearch
5/46
Methods Cluster Analysis
Generally interval scaled variables are ideally suited for
cluster analysis Ratio scaled variables can also be used
Standardisation of variables is necessary if units of
measurement of variables widely differ First, an initial clustering solution is obtained using a
hierarchical procedure, such as average linkage or
Ward's. The number of clusters so obtained are used as
inputs to the non- hierarchical procedure such asoptimizing partitioning method.
8/4/2019 Cluster Analysis-Marketing Reasearch
6/46
Steps in Analysis Run the hierarchical clustering programme on the variables ( after
standardisation if necessary)
Generate output called Agglomeration schedule
It shows all possible solutions from 1to n-1 clusters where n is thenumber of respondents or objects
Going up from the bottom of the Agglomeration schedule look at the
column called coefficients to decide on number of clusters
In this column starting from the bottom ,calculate difference in thevalue of coefficient in the neighbouring rows.
If the maximum value of this difference occurs ,say, between third &
fourth row from the bottom it indicates there might be 3 clusters ( thelower row number)
The dendogram & icicle plot can also be requested & will giveessentially same information in graphical form .
8/4/2019 Cluster Analysis-Marketing Reasearch
7/46
Steps in Analysis
Once the number of clusters have been identified a
Kmeans clustering approach can be run on the data
The number of clusters obtained in the first stage are
used as input & output obtained
This output gives the initial & final cluster centres
for each variable
Final cluster centres are the best solutions
These are used to interpret the average value of each
variable for a cluster & thereby describe the clusters
8/4/2019 Cluster Analysis-Marketing Reasearch
8/46
Example Clustering of consumers based on attitude towards shopping
Six attitudinal variables were identified
Consumers were asked to express their degree of agreement with thefollowing statements on a 7 point scale(1=Disagree;2=Agree)
V1=Shopping is fun
V2=Shopping is bad for your budget
V3=I combine shopping with eating out
V4=I get best buys when shopping
V5=I do not care about shopping
V6=You can save a lot of money by comparing prices
Data obtained from 20 respondents shown in next slide
In reality sample size should be much larger
8/4/2019 Cluster Analysis-Marketing Reasearch
9/46
Attitudinal Data For Clustering
Case No. V1 V2 V3 V4 V5 V6
1 6 4 7 3 2 32 2 3 1 4 5 43 7 2 6 4 1 34 4 6 4 5 3 65 1 3 2 2 6 46 6 4 6 3 3 47 5 3 6 3 3 48 7 3 7 4 1 49 2 4 3 3 6 310 3 5 3 6 4 611 1 3 2 3 5 312 5 4 5 4 2 413 2 2 1 5 4 414 4 6 4 6 4 715 6 5 4 2 1 416 3 5 4 6 4 717 4 4 7 2 2 518 3 7 2 6 4 319 4 6 3 7 2 720 2 3 2 4 7 2
Table 1
8/4/2019 Cluster Analysis-Marketing Reasearch
10/46
Results of Hierarchical Clustering
Stage clusterClusters combined first appears
Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage1 14 16 1.000000 0 0 62 6 7 2.000000 0 0 7
3 2 13 3.500000 0 0 154 5 11 5.000000 0 0 115 3 8 6.500000 0 0 166 10 14 8.160000 0 1 97 6 12 10.166667 2 0 108 9 20 13.000000 0 0 119 4 10 15.583000 0 6 1210 1 6 18.500000 6 7 1311 5 9 23.000000 4 8 1512 4 19 27.750000 9 0 1713 1 17 33.100000 10 0 1414 1 15 41.333000 13 0 1615 2 5 51.833000 3 11 1816 1 3 64.500000 14 5 1917 4 18 79.667000 12 0 18
18 2 4 172.662000 15 17 1919 1 2 328.600000 16 18 0
Agglomeration Schedule Using Wards Procedure
8/4/2019 Cluster Analysis-Marketing Reasearch
11/46
Results of Hierarchical Clustering
Stage clusterClusters combined first appears
Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage1 14 16 1.000000 0 0 62 6 7 2.000000 0 0 7
3 2 13 3.500000 0 0 154 5 11 5.000000 0 0 115 3 8 6.500000 0 0 166 10 14 8.160000 0 1 97 6 12 10.166667 2 0 108 9 20 13.000000 0 0 119 4 10 15.583000 0 6 1210 1 6 18.500000 6 7 1311 5 9 23.000000 4 8 1512 4 19 27.750000 9 0 1713 1 17 33.100000 10 0 1414 1 15 41.333000 13 0 1615 2 5 51.833000 3 11 1816 1 3 64.500000 14 5 1917 4 18 79.667000 12 0 18
18 2 4 172.662000 15 17 1919 1 2 328.600000 16 18 0
Agglomeration Schedule Using Wards Procedure
8/4/2019 Cluster Analysis-Marketing Reasearch
12/46
Results of Hierarchical Clustering
Number of ClustersLabel case 4 3 2
1 1 1 12 2 2 23 1 1 1
4 3 3 25 2 2 26 1 1 17 1 1 18 1 1 19 2 2 210 3 3 211 2 2 2
12 1 1 113 2 2 214 3 3 215 1 1 116 3 3 217 1 1 118 4 3 219 3 3 2
20 2 2 2
Cluster Membership of Cases Using Wards Procedure
8/4/2019 Cluster Analysis-Marketing Reasearch
13/46
Vertical Icicle Plot Using Wards Method
8/4/2019 Cluster Analysis-Marketing Reasearch
14/46
Dendrogram Using Wards Method
8/4/2019 Cluster Analysis-Marketing Reasearch
15/46
Interpretation
Theoretical, or practical considerations may
suggest a certain number of clusters If the purpose is market segmentation
management may want a particular number ofclusters
Value in coefficients column suddenly more thandoubles between stages 17(3 clusters) & stage 18(2 clusters)
Likewise at the last 2 stages of dendogram clusters
are being combined at large distances
It appears that a three cluster solution isappropriate
8/4/2019 Cluster Analysis-Marketing Reasearch
16/46
Dendrogram Using Wards Method
8/4/2019 Cluster Analysis-Marketing Reasearch
17/46
Interpretation
A 3 cluster solution results in clusters with
8,6 & 6 respondents
A 4 cluster solution has 8,6,5 & 1
respondents
It is not meaningful to have a cluster with
only one case ,so a 3 cluster solution is
preferable
Interpreting & profiling clusters involves
examining cluster centroids
8/4/2019 Cluster Analysis-Marketing Reasearch
18/46
Cluster Centroids
Means of Variables
Cluster No. V1 V2 V3 V4 V5 V6
1 5.750 3.625 6.000 3.125 1.750 3.875
2 1.667 3.000 1.833 3.500 5.500 3.333
3 3.500 5.833 3.333 6.000 3.500 6.000
8/4/2019 Cluster Analysis-Marketing Reasearch
19/46
Interpretation Cluster 1 is high on
V1: Shopping is Fun V3:Combine shopping with eating out
V5: Low: Do not care about shopping
FUN LOVING & CONCERNED SHOPPERS(1,3,6,7,8,12,15,17)
Cluster II is high on V5:DO not care about shopping
V1:Low:Shopping is Fun
V3:Low:Combine shopping with eating out
APATHETIC SHOPPERS (2,5,9,11,13,20) Cluster III is High on
V2: Shopping upsets budget
V4: Try to get best buys
V6: Can save a lot of money by comparing prices
ECONOMICL SHOPPERS (10,14,1618&19)
8/4/2019 Cluster Analysis-Marketing Reasearch
20/46
Interpretation
Further profiling can be done on the basis
of variables not used for clustering Demographic, psychographic, product
usage, media usage variables can be used
to target marketing efforts for each cluster The variables that significantly
differentiates between clusters can beobtained through Discriminant analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
21/46
Iteration Historya
2.154 2.102 2.550
0.000 0.000 0.000
Iteration1
2
1 2 3
Change in Cluster Centers
Convergence achieved due to no or small distancechange. The maximum distance by which any centerhas changed is 0.000. The current iteration is 2. The
minimum distance between initial centers is 7.746.
a.
Results of Nonhierarchical Clustering
Initial Cluster Centers
4 2 7
6 3 2
3 2 6
7 4 42 7 1
7 2 3
V1
V2
V3
V4V5
V6
1 2 3
Cluster
8/4/2019 Cluster Analysis-Marketing Reasearch
22/46
Results of Nonhierarchical Clustering
.Cluster Membership
3 1.414
2 1.323
3 2.550
1 1.404
2 1.848
3 1.225
3 1.500
3 2.121
2 1.756
1 1.143
2 1.041
3 1.581
2 2.5981 1.404
3 2.828
1 1.624
3 2.598
1 3.555
1 2.154
2 2.102
Case Number1
2
3
4
5
6
7
8
9
10
11
12
1314
15
16
17
18
19
20
Cluster Distance
R lt f N hi hi l Cl t i
8/4/2019 Cluster Analysis-Marketing Reasearch
23/46
Results of Nonhierarchical Clustering
Final Cluster Centers
4 2 6
6 3 4
3 2 6
6 4 3
4 6 2
6 3 4
V1
V2
V3
V4
V5
V6
1 2 3
Cluster
Distances between Final Cluster Centers
5.568 5.698
5.568 6.928
5.698 6.928
Cluster
1
2
3
1 2 3
8/4/2019 Cluster Analysis-Marketing Reasearch
24/46
Cl i V i bl
8/4/2019 Cluster Analysis-Marketing Reasearch
25/46
Clustering Variables
In this instance, the units used for analysis are thevariables, and the distance measures are computed for all
pairs of variables. Hierarchical clustering of variables can aid in the
identification of unique variables, or variables that make aunique contribution to the data.
Clustering can also be used to reduce the number ofvariables. Associated with each cluster is a linearcombination of the variables in the cluster, called thecluster component. A large set of variables can often bereplaced by the set of cluster components with little loss of
information. However, a given number of clustercomponents does not generally explain as much varianceas the same number of principal components.
8/4/2019 Cluster Analysis-Marketing Reasearch
26/46
SPSS Windows
To select this procedures using SPSS for Windows click:
Analyze>Classify>Hierarchical Cluster
Analyze>Classify>K-Means Cluster
Difference Between Cluster Analysis& Discriminant Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
27/46
Difference Between Cluster Analysis& Discriminant Analysis
Cluster analysis is a class of techniques used to classifyobjects or cases into relatively homogeneous groups
called clusters. Objects in each cluster tend to be similarto each other and dissimilar to objects in the otherclusters. Cluster analysis is also called classificationanalysis, or numerical taxonomy.
Both cluster analysis and discriminant analysis areconcerned with classification. However, discriminantanalysis requires prior knowledge of the cluster or groupmembership for each object or case included, to developthe classification rule. In contrast, in cluster analysisthere is no a priori information about the group or clustermembership for any of the objects. Groups or clustersare suggested by the data, not defined a priori.
8/4/2019 Cluster Analysis-Marketing Reasearch
28/46
An Ideal Clustering Situation
Variable 2
Variable1
8/4/2019 Cluster Analysis-Marketing Reasearch
29/46
A Practical Clustering Situation
X
Variable 2
Variable
1
8/4/2019 Cluster Analysis-Marketing Reasearch
30/46
Statistics Associated with Cluster
Analysis Agglomeration schedule. An agglomeration schedule
gives information on the objects or cases being combinedat each stage of a hierarchical clustering process.
Cluster centroid. The cluster centroid is the mean valuesof the variables for all the cases or objects in a particularcluster.
Cluster centers. The cluster centers are the initial starting
points in nonhierarchical clustering. Clusters are builtaround these centers, or seeds.
Cluster membership. Cluster membership indicates thecluster to which each object or case belongs.
8/4/2019 Cluster Analysis-Marketing Reasearch
31/46
Statistics Associated with Cluster
Analysis Dendrogram. A dendrogram, or tree graph, is a graphical
device for displaying clustering results. Vertical lines
represent clusters that are joined together. The position of
the line on the scale indicates the distances at whichclusters were joined. The dendrogram is read from left to
right. Fig 2 is a dendrogram.
Distances between cluster centers. These distances
indicate how separated the individual pairs of clusters are.Clusters that are widely separated are distinct, and
therefore desirable.
8/4/2019 Cluster Analysis-Marketing Reasearch
32/46
Statistics Associated with Cluster
Analysis Icicle diagram. An icicle diagram is a graphical display
of clustering results, so called because it resembles a row
of icicles hanging from the eaves of a house. The columns
correspond to the objects being clustered, and the rowscorrespond to the number of clusters. An icicle diagram is
read from bottom to top. Figure 20.7 is an icicle diagram.
Similarity/distance coefficient matrix. A
similarity/distance coefficient matrix is a lower-trianglematrix containing pairwise distances between objects or
cases.
8/4/2019 Cluster Analysis-Marketing Reasearch
33/46
Conducting Cluster Analysis
Formulate the Problem
Assess the Validity of Clustering
Select a Distance Measure
Select a Clustering Procedure
Decide on the Number of Clusters
Interpret and Profile Clusters
Conducting Cluster Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
34/46
Conducting Cluster Analysis
Formulate the Problem Perhaps the most important part of formulating the
clustering problem is selecting the variables on which theclustering is based.
Inclusion of even one or two irrelevant variables may
distort an otherwise useful clustering solution.
Basically, the set of variables selected should describe thesimilarity between objects in terms that are relevant to the
marketing research problem.
The variables should be selected based on past research,
theory, or a consideration of the hypotheses being tested.In exploratory research, the researcher should exercise
judgment and intuition.
8/4/2019 Cluster Analysis-Marketing Reasearch
35/46
A Cl ifi ti f Cl t i P d
8/4/2019 Cluster Analysis-Marketing Reasearch
36/46
A Classification of Clustering Procedures
SequentialThreshold
ParallelThreshold
OptimizingPartitioning
Single Complete Average
Clustering Procedures
NonhierarchicalHierarchical
Agglomerative Divisive
Wards Method
LinkageMethods
VarianceMethods
CentroidMethods
8/4/2019 Cluster Analysis-Marketing Reasearch
37/46
Hierarchical clustering is characterized by the development of ahierarchy or tree-like structure. Hierarchical methods can beagglomerative or divisive.
Agglomerative clustering starts with each object in a separatecluster. Clusters are formed by grouping objects into bigger and
bigger clusters. This process is continued until all objects aremembers of a single cluster.
Divisive clustering starts with all the objects grouped in a singlecluster. Clusters are divided or split until each object is in a separatecluster.
Agglomerative methods are commonly used in marketing research.They consist of linkage methods, error sums of squares or variancemethods, and centroid methods.
Conducting Cluster AnalysisSelect a Clustering ProcedureHierarchical
8/4/2019 Cluster Analysis-Marketing Reasearch
38/46
The single linkage method is based on minimum distance, or thenearest neighbor rule. At every stage, the distance between twoclusters is the distance between their two closest points
The complete linkage method is similar to single linkage, except thatit is based on the maximum distance or the furthest neighbor
approach. In complete linkage, the distance between two clusters iscalculated as the distance between their two furthest points.
The average linkage method works similarly. However, in thismethod, the distance between two clusters is defined as the average ofthe distances between all pairs of objects, where one member of the
pair is from each of the clusters
Conducting Cluster AnalysisSelect a Clustering ProcedureLinkage Method
Linkage Methods of Clustering
8/4/2019 Cluster Analysis-Marketing Reasearch
39/46
Linkage Methods of Clustering
Single Linkage
Minimum Distance
Complete Linkage
MaximumDistance
Average Linkage
Average Distance
Cluster 1 Cluster 2
Cluster 1 Cluster 2
Cluster 1 Cluster 2
Conducting Cluster Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
40/46
The variance methods attempt to generate clusters to minimize the
within-cluster variance. A commonly used variance method is the Ward's procedure. For each
cluster, the means for all the variables are computed. Then, for eachobject, the squared Euclidean distance to the cluster means is calculated .These distances are summed for all the objects. At each stage, the twoclusters with the smallest increase in the overall sum of squares withincluster distances are combined.
In the centroid methods, the distance between two clusters is the distancebetween their centroids (means for all the variables). Every time objectsare grouped, a new centroid is computed.
Of the hierarchical methods, average linkage and Ward's methods havebeen shown to perform better than the other procedures.
Conducting Cluster AnalysisSelect a Clustering ProcedureVariance Method
8/4/2019 Cluster Analysis-Marketing Reasearch
41/46
Other Agglomerative Clustering Methods
Wards Procedure
Centroid Method
8/4/2019 Cluster Analysis-Marketing Reasearch
42/46
Conducting Cluster Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
43/46
It has been suggested that the hierarchical and nonhierarchical
methods be used in tandem. First, an initial clustering solution isobtained using a hierarchical procedure, such as average linkage or
Ward's. The number of clusters and cluster centroids so obtained are
used as inputs to the optimizing partitioning method.
Choice of a clustering method and choice of a distance measure areinterrelated. For example, squared Euclidean distances should be used
with the Ward's and centroid methods. Several nonhierarchical
procedures also use squared Euclidean distances.
Conducting Cluster Analysis
Select a Clustering Procedure
Conducting Cluster Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
44/46
Theoretical, conceptual, or practical considerations may suggest a
certain number of clusters.
In hierarchical clustering, the distances at which clusters are
combined can be used as criteria. This information can be obtained
from the agglomeration schedule or from the dendrogram.
In nonhierarchical clustering, the ratio of total within-groupvariance to between-group variance can be plotted against the
number of clusters. The point at which an elbow or a sharp bend
occurs indicates an appropriate number of clusters.
The relative sizes of the clusters should be meaningful.
Conducting Cluster Analysis
Decide on the Number of Clusters
8/4/2019 Cluster Analysis-Marketing Reasearch
45/46
Interpreting and profiling clusters involves examining the
cluster centroids. The centroids enable us to describe each
cluster by assigning it a name or label.
It is often helpful to profile the clusters in terms of
variables that were not used for clustering. These may
include demographic, psychographic, product usage, media
usage, or other variables.
Conducting Cluster Analysis
Interpreting and Profiling the Clusters
Conducting Cluster Analysis
8/4/2019 Cluster Analysis-Marketing Reasearch
46/46
1. Perform cluster analysis on the same data using different distance measures.
Compare the results across measures to determine the stability of thesolutions.
2. Use different methods of clustering and compare the results.
3. Split the data randomly into halves. Perform clustering separately on eachhalf. Compare cluster centroids across the two subsamples.
4. Delete variables randomly. Perform clustering based on the reduced set ofvariables. Compare the results with those obtained by clustering based onthe entire set of variables.
5. In nonhierarchical clustering, the solution may depend on the order of casesin the data set. Make multiple runs using different order of cases until the
solution stabilizes.
g y
Assess Reliability and Validity
Top Related