Download - Cluster Analysis-Marketing Reasearch

8/4/2019 Cluster Analysis-Marketing Reasearch

1/46

Business Research Methods

Cluster Analysis


2/46

Cluster Analysis

It is a multivariate interdependence analysis that

classifies individuals or objects into a small numberof mutually exclusive & exhaustive groups ,ensuringthat there is as much likeness within groups and asmuch difference among groups.

Cluster analysis differs from multiple discriminantanalysis in that the groups are not predefined. Thepurpose of cluster analysis is to determine how

many distinct groups exist and to define theircomposition. It does not predict relationships & isnot a dependence technique


3/46

Uses of Cluster Analysis It is a multivariate interdependence procedure ideally suited to

segmentation application in marketing Cluster by definition is a group of similar objects

Segmentation involves identifying groups oftarget customers who aresimilar in buying habits , demographic characteristics orpsychographics

There could be clusters of brands similar to each other & differentfrom other clusters( Soaps on the basis of various characteristics )

Cluster analysis is also used in selection of test market cities that aresufficiently similar so that no extraneous variation can cause

difference between the experimental & control markets.( population,retail sales, number of retail outlets, proportion of various incomegroups etc)


4/46

Methods Cluster Analysis Basic Methods of 2 types

1. Hierarchical or Linkage methods

2. Non_ hierarchical or Nodal Methods or k-means clustering

In the first method we do not have to specify in advance howmany clusters are to be extracted .A range of solutions isprovided by computers ranging from 1 to n-1 clusters where n is

number of objects being studied In second method Number of clusters to be extracted is to bespecified in advance

Both methods use a distance Measure to calculate distancesbetween objects in terms of its various characteristics or

variables. Most commonly used measure is Euclidean distance. TheEuclidean distance is the square root of the sum of the squareddifferences in values for each variable .


5/46

Methods Cluster Analysis

Generally interval scaled variables are ideally suited for

cluster analysis Ratio scaled variables can also be used

Standardisation of variables is necessary if units of

measurement of variables widely differ First, an initial clustering solution is obtained using a

hierarchical procedure, such as average linkage or

Ward's. The number of clusters so obtained are used as

inputs to the nonhierarchical procedure such asoptimizing partitioning method.


6/46

Steps in Analysis Run the hierarchical clustering programme on the variables ( after

standardisation if necessary)

Generate output called Agglomeration schedule

It shows all possible solutions from 1to n-1 clusters where n is thenumber of respondents or objects

Going up from the bottom of the Agglomeration schedule look at the

column called coefficients to decide on number of clusters

In this column starting from the bottom ,calculate difference in thevalue of coefficient in the neighbouring rows.

If the maximum value of this difference occurs ,say, between third &

fourth row from the bottom it indicates there might be 3 clusters ( thelower row number)

The dendogram & icicle plot can also be requested & will giveessentially same information in graphical form .


7/46

Steps in Analysis

Once the number of clusters have been identified a

Kmeans clustering approach can be run on the data

The number of clusters obtained in the first stage are

used as input & output obtained

This output gives the initial & final cluster centres

for each variable

Final cluster centres are the best solutions

These are used to interpret the average value of each

variable for a cluster & thereby describe the clusters


8/46

Example Clustering of consumers based on attitude towards shopping

Six attitudinal variables were identified

Consumers were asked to express their degree of agreement with thefollowing statements on a 7 point scale(1=Disagree;2=Agree)

V1=Shopping is fun

V2=Shopping is bad for your budget

V3=I combine shopping with eating out

V4=I get best buys when shopping

V5=I do not care about shopping

V6=You can save a lot of money by comparing prices

Data obtained from 20 respondents shown in next slide

In reality sample size should be much larger


9/46

Attitudinal Data For Clustering

Case No. V1 V2 V3 V4 V5 V6

1 6 4 7 3 2 32 2 3 1 4 5 43 7 2 6 4 1 34 4 6 4 5 3 65 1 3 2 2 6 46 6 4 6 3 3 47 5 3 6 3 3 48 7 3 7 4 1 49 2 4 3 3 6 310 3 5 3 6 4 611 1 3 2 3 5 312 5 4 5 4 2 413 2 2 1 5 4 414 4 6 4 6 4 715 6 5 4 2 1 416 3 5 4 6 4 717 4 4 7 2 2 518 3 7 2 6 4 319 4 6 3 7 2 720 2 3 2 4 7 2

Table 1


10/46

Results of Hierarchical Clustering

Stage clusterClusters combined first appears

Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage1 14 16 1.000000 0 0 62 6 7 2.000000 0 0 7

3 2 13 3.500000 0 0 154 5 11 5.000000 0 0 115 3 8 6.500000 0 0 166 10 14 8.160000 0 1 97 6 12 10.166667 2 0 108 9 20 13.000000 0 0 119 4 10 15.583000 0 6 1210 1 6 18.500000 6 7 1311 5 9 23.000000 4 8 1512 4 19 27.750000 9 0 1713 1 17 33.100000 10 0 1414 1 15 41.333000 13 0 1615 2 5 51.833000 3 11 1816 1 3 64.500000 14 5 1917 4 18 79.667000 12 0 18

18 2 4 172.662000 15 17 1919 1 2 328.600000 16 18 0

Agglomeration Schedule Using Wards Procedure


11/46


Stage clusterClusters combined first appears

Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage1 14 16 1.000000 0 0 62 6 7 2.000000 0 0 7

3 2 13 3.500000 0 0 154 5 11 5.000000 0 0 115 3 8 6.500000 0 0 166 10 14 8.160000 0 1 97 6 12 10.166667 2 0 108 9 20 13.000000 0 0 119 4 10 15.583000 0 6 1210 1 6 18.500000 6 7 1311 5 9 23.000000 4 8 1512 4 19 27.750000 9 0 1713 1 17 33.100000 10 0 1414 1 15 41.333000 13 0 1615 2 5 51.833000 3 11 1816 1 3 64.500000 14 5 1917 4 18 79.667000 12 0 18

18 2 4 172.662000 15 17 1919 1 2 328.600000 16 18 0

Agglomeration Schedule Using Wards Procedure


12/46


Number of ClustersLabel case 4 3 2

1 1 1 12 2 2 23 1 1 1

4 3 3 25 2 2 26 1 1 17 1 1 18 1 1 19 2 2 210 3 3 211 2 2 2

12 1 1 113 2 2 214 3 3 215 1 1 116 3 3 217 1 1 118 4 3 219 3 3 2

20 2 2 2

Cluster Membership of Cases Using Wards Procedure


13/46

Vertical Icicle Plot Using Wards Method


14/46

Dendrogram Using Wards Method


15/46

Interpretation

Theoretical, or practical considerations may

suggest a certain number of clusters If the purpose is market segmentation

management may want a particular number ofclusters

Value in coefficients column suddenly more thandoubles between stages 17(3 clusters) & stage 18(2 clusters)

Likewise at the last 2 stages of dendogram clusters

are being combined at large distances

It appears that a three cluster solution isappropriate


16/46

Dendrogram Using Wards Method


17/46

Interpretation

A 3 cluster solution results in clusters with

8,6 & 6 respondents

A 4 cluster solution has 8,6,5 & 1

respondents

It is not meaningful to have a cluster with

only one case ,so a 3 cluster solution is

preferable

Interpreting & profiling clusters involves

examining cluster centroids


18/46

Cluster Centroids

Means of Variables

Cluster No. V1 V2 V3 V4 V5 V6

1 5.750 3.625 6.000 3.125 1.750 3.875

2 1.667 3.000 1.833 3.500 5.500 3.333

3 3.500 5.833 3.333 6.000 3.500 6.000


19/46

Interpretation Cluster 1 is high on

V1: Shopping is Fun V3:Combine shopping with eating out

V5: Low: Do not care about shopping

FUN LOVING & CONCERNED SHOPPERS(1,3,6,7,8,12,15,17)

Cluster II is high on V5:DO not care about shopping

V1:Low:Shopping is Fun

V3:Low:Combine shopping with eating out

APATHETIC SHOPPERS (2,5,9,11,13,20) Cluster III is High on

V2: Shopping upsets budget

V4: Try to get best buys

V6: Can save a lot of money by comparing prices

ECONOMICL SHOPPERS (10,14,1618&19)


20/46

Interpretation

Further profiling can be done on the basis

of variables not used for clustering Demographic, psychographic, product

usage, media usage variables can be used

to target marketing efforts for each cluster The variables that significantly

differentiates between clusters can beobtained through Discriminant analysis


21/46

Iteration Historya

2.154 2.102 2.550

0.000 0.000 0.000

Iteration1

2

1 2 3

Change in Cluster Centers

Convergence achieved due to no or small distancechange. The maximum distance by which any centerhas changed is 0.000. The current iteration is 2. The

minimum distance between initial centers is 7.746.

a.

Results of Nonhierarchical Clustering

Initial Cluster Centers

4 2 7

6 3 2

3 2 6

7 4 42 7 1

7 2 3

V1

V2

V3

V4V5

V6

1 2 3

Cluster


22/46


.Cluster Membership

3 1.414

2 1.323

3 2.550

1 1.404

2 1.848

3 1.225

3 1.500

3 2.121

2 1.756

1 1.143

2 1.041

3 1.581

2 2.5981 1.404

3 2.828

1 1.624

3 2.598

1 3.555

1 2.154

2 2.102

Case Number1

2

3

4

5

6

7

8

9

10

11

12

1314

15

16

17

18

19

20

Cluster Distance

R lt f N hi hi l Cl t i


23/46


Final Cluster Centers

4 2 6

6 3 4

3 2 6

6 4 3

4 6 2

6 3 4

V1

V2

V3

V4

V5

V6

1 2 3

Cluster

Distances between Final Cluster Centers

5.568 5.698

5.568 6.928

5.698 6.928

Cluster

1

2

3

1 2 3


24/46

Cl i V i bl


25/46

Clustering Variables

In this instance, the units used for analysis are thevariables, and the distance measures are computed for all

pairs of variables. Hierarchical clustering of variables can aid in the

identification of unique variables, or variables that make aunique contribution to the data.

Clustering can also be used to reduce the number ofvariables. Associated with each cluster is a linearcombination of the variables in the cluster, called thecluster component. A large set of variables can often bereplaced by the set of cluster components with little loss of

information. However, a given number of clustercomponents does not generally explain as much varianceas the same number of principal components.


26/46

SPSS Windows

To select this procedures using SPSS for Windows click:

Analyze>Classify>Hierarchical Cluster

Analyze>Classify>K-Means Cluster

Difference Between Cluster Analysis& Discriminant Analysis


27/46

Difference Between Cluster Analysis& Discriminant Analysis

Cluster analysis is a class of techniques used to classifyobjects or cases into relatively homogeneous groups

called clusters. Objects in each cluster tend to be similarto each other and dissimilar to objects in the otherclusters. Cluster analysis is also called classificationanalysis, or numerical taxonomy.

Both cluster analysis and discriminant analysis areconcerned with classification. However, discriminantanalysis requires prior knowledge of the cluster or groupmembership for each object or case included, to developthe classification rule. In contrast, in cluster analysisthere is no a priori information about the group or clustermembership for any of the objects. Groups or clustersare suggested by the data, not defined a priori.


28/46

An Ideal Clustering Situation

Variable 2

Variable1


29/46

A Practical Clustering Situation

X

Variable 2

Variable

1


30/46

Statistics Associated with Cluster

Analysis Agglomeration schedule. An agglomeration schedule

gives information on the objects or cases being combinedat each stage of a hierarchical clustering process.

Cluster centroid. The cluster centroid is the mean valuesof the variables for all the cases or objects in a particularcluster.

Cluster centers. The cluster centers are the initial starting

points in nonhierarchical clustering. Clusters are builtaround these centers, or seeds.

Cluster membership. Cluster membership indicates thecluster to which each object or case belongs.


31/46


Analysis Dendrogram. A dendrogram, or tree graph, is a graphical

device for displaying clustering results. Vertical lines

represent clusters that are joined together. The position of

the line on the scale indicates the distances at whichclusters were joined. The dendrogram is read from left to

right. Fig 2 is a dendrogram.

Distances between cluster centers. These distances

indicate how separated the individual pairs of clusters are.Clusters that are widely separated are distinct, and

therefore desirable.


32/46


Analysis Icicle diagram. An icicle diagram is a graphical display

of clustering results, so called because it resembles a row

of icicles hanging from the eaves of a house. The columns

correspond to the objects being clustered, and the rowscorrespond to the number of clusters. An icicle diagram is

read from bottom to top. Figure 20.7 is an icicle diagram.

Similarity/distance coefficient matrix. A

similarity/distance coefficient matrix is a lower-trianglematrix containing pairwise distances between objects or

cases.


33/46

Conducting Cluster Analysis

Formulate the Problem

Assess the Validity of Clustering

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters



34/46


Formulate the Problem Perhaps the most important part of formulating the

clustering problem is selecting the variables on which theclustering is based.

Inclusion of even one or two irrelevant variables may

distort an otherwise useful clustering solution.

Basically, the set of variables selected should describe thesimilarity between objects in terms that are relevant to the

marketing research problem.

The variables should be selected based on past research,

theory, or a consideration of the hypotheses being tested.In exploratory research, the researcher should exercise

judgment and intuition.


35/46

A Cl ifi ti f Cl t i P d


36/46

A Classification of Clustering Procedures

SequentialThreshold

ParallelThreshold

OptimizingPartitioning

Single Complete Average

Clustering Procedures

NonhierarchicalHierarchical

Agglomerative Divisive

Wards Method

LinkageMethods

VarianceMethods

CentroidMethods


37/46

Hierarchical clustering is characterized by the development of ahierarchy or tree-like structure. Hierarchical methods can beagglomerative or divisive.

Agglomerative clustering starts with each object in a separatecluster. Clusters are formed by grouping objects into bigger and

bigger clusters. This process is continued until all objects aremembers of a single cluster.

Divisive clustering starts with all the objects grouped in a singlecluster. Clusters are divided or split until each object is in a separatecluster.

Agglomerative methods are commonly used in marketing research.They consist of linkage methods, error sums of squares or variancemethods, and centroid methods.

Conducting Cluster AnalysisSelect a Clustering ProcedureHierarchical


38/46

The single linkage method is based on minimum distance, or thenearest neighbor rule. At every stage, the distance between twoclusters is the distance between their two closest points

The complete linkage method is similar to single linkage, except thatit is based on the maximum distance or the furthest neighbor

approach. In complete linkage, the distance between two clusters iscalculated as the distance between their two furthest points.

The average linkage method works similarly. However, in thismethod, the distance between two clusters is defined as the average ofthe distances between all pairs of objects, where one member of the

pair is from each of the clusters

Conducting Cluster AnalysisSelect a Clustering ProcedureLinkage Method

Linkage Methods of Clustering


39/46

Linkage Methods of Clustering

Single Linkage

Minimum Distance

Complete Linkage

MaximumDistance

Average Linkage

Average Distance

Cluster 1 Cluster 2

Cluster 1 Cluster 2

Cluster 1 Cluster 2



40/46

The variance methods attempt to generate clusters to minimize the

within-cluster variance. A commonly used variance method is the Ward's procedure. For each

cluster, the means for all the variables are computed. Then, for eachobject, the squared Euclidean distance to the cluster means is calculated .These distances are summed for all the objects. At each stage, the twoclusters with the smallest increase in the overall sum of squares withincluster distances are combined.

In the centroid methods, the distance between two clusters is the distancebetween their centroids (means for all the variables). Every time objectsare grouped, a new centroid is computed.

Of the hierarchical methods, average linkage and Ward's methods havebeen shown to perform better than the other procedures.

Conducting Cluster AnalysisSelect a Clustering ProcedureVariance Method


41/46

Other Agglomerative Clustering Methods

Wards Procedure

Centroid Method


42/46



43/46

It has been suggested that the hierarchical and nonhierarchical

methods be used in tandem. First, an initial clustering solution isobtained using a hierarchical procedure, such as average linkage or

Ward's. The number of clusters and cluster centroids so obtained are

used as inputs to the optimizing partitioning method.

Choice of a clustering method and choice of a distance measure areinterrelated. For example, squared Euclidean distances should be used

with the Ward's and centroid methods. Several nonhierarchical

procedures also use squared Euclidean distances.


Select a Clustering Procedure



44/46

Theoretical, conceptual, or practical considerations may suggest a

certain number of clusters.

In hierarchical clustering, the distances at which clusters are

combined can be used as criteria. This information can be obtained

from the agglomeration schedule or from the dendrogram.

In nonhierarchical clustering, the ratio of total within-groupvariance to between-group variance can be plotted against the

number of clusters. The point at which an elbow or a sharp bend

occurs indicates an appropriate number of clusters.

The relative sizes of the clusters should be meaningful.


Decide on the Number of Clusters


45/46

Interpreting and profiling clusters involves examining the

cluster centroids. The centroids enable us to describe each

cluster by assigning it a name or label.

It is often helpful to profile the clusters in terms of

variables that were not used for clustering. These may

include demographic, psychographic, product usage, media

usage, or other variables.


Interpreting and Profiling the Clusters



46/46

1. Perform cluster analysis on the same data using different distance measures.

Compare the results across measures to determine the stability of thesolutions.

2. Use different methods of clustering and compare the results.

3. Split the data randomly into halves. Perform clustering separately on eachhalf. Compare cluster centroids across the two subsamples.

4. Delete variables randomly. Perform clustering based on the reduced set ofvariables. Compare the results with those obtained by clustering based onthe entire set of variables.

5. In nonhierarchical clustering, the solution may depend on the order of casesin the data set. Make multiple runs using different order of cases until the

solution stabilizes.

g y

Assess Reliability and Validity