Multivariate Analysis

44
ARVIND BANGER ASSISTANT PROFESSOR DEPARTMENT OF MANAGEMENT FACULTY OF SOCIAL SCIENCES DEI ARVIND BANGER 1

description

Multivariate Analysis

Transcript of Multivariate Analysis

  • ARVIND BANGER

    ASSISTANT PROFESSOR

    DEPARTMENT OF MANAGEMENT

    FACULTY OF SOCIAL SCIENCES

    DEI

    ARVIND BANGER 1

  • MULTIVARIATE ANALYSIS TECHNIQUE

    It is used to simultaneously analyze more than two variables on a sample of observations.

    Objective : to represent a collection of massive data in a simplified way.

    i.e. transform a mass of observations into a smaller number of composite scores in such a way that they may reflect as much information as possible contained in the raw data obtained concerning the research study

    ARVIND BANGER 2

  • All multivariate methods

    Are some variables

    dependent?

    YES NO

    Dependence

    methods Interdependence

    methods

    How many variables

    are dependent?

    Are all inputs

    Metric?

    ONE Several

    Is it metric? Are they metric?

    YES NO

    Multiple

    Regression

    Multiple

    discriminant

    analysis

    YES

    MAV

    NO

    Canonical analysis

    YES NO

    Factor

    analysis

    Cluster

    analysis

    Metric

    MDS

    Non-metric

    MDS

    Latent structure

    analysis

    ARVIND BANGER 3

  • VARIABLES IN MULTIVARIATE ANALYSIS

    Explanatory Variable & Criterion Variable

    If X may be considered to be the cause of Y, then X is described as explanatory variable and Y is described as criterion variable. In some cases both explanatory variable & criterion variable may consist of a set of many variables in which case set (X1, X2, X3, ..Xp) may be called a set of explanatory variables and the set (Y1, Y2, Y3, ..Yp) may be called a set of criterion variables if the variation of the former may be supposed to cause the variation of the latter as whole.

    ARVIND BANGER 4

  • OBSERVABLE VARIABLES & LATENT VARIABLES

    Explanatory variables described above are supposed to be observable directly in some situations, and if this is so, the same are termed as observable variables. However, there are some unobservable variables which may influence the criterion variables. We call such unobservable variables as latent variables.

    DISCRETE VARIABLE & CONTINUOUS VARIABLE

    Discrete variable is that variable which when measured may take only the integer value whereas continuous variable is one which, when measured, can assume any real value.

    ARVIND BANGER 5

  • DUMMY VARIABLE

    This term is being used in a technical sense and is useful in algebraic manipulations in context of multivariate analysis. We call Xi (i = 1,..,m) a dummy variable, if only one of Xi is 1 and the others are all zero.

    ARVIND BANGER 6

  • IMPORTANT MULTIVARIATE

    TECHNIQUES

    ARVIND BANGER 7

  • MULTIPLE DISCRIMINANT ANALYSIS Through this method we classify individuals or objects

    into two or more mutually exclusive & exhaustive

    groups on the basis of a set of independent variables.

    Used for single dependent variable which in non-

    metric

    E.g. brand preference, that depends on individuals

    age, education, income etc.

    ARVIND BANGER 8

  • Contd E.g. if an individual is 20 years old, has income of Rs 12,000

    and 10 years of formal education

    If b1, b2, b3 are the weights given to these independent variables, then his score would be

    Z=b1(20)+b2(12000)+b3(10)

    ARVIND BANGER 9

  • FACTOR ANALYSIS APPLICABLE WHEN there is systematic

    interdependence among a set of observed variables and the researcher wants to find out something more fundamental/latent which creates this commonality.

    E.g. observed variables: income, education,

    occupation, dwelling area

    latent factor: social class

    ARVIND BANGER 10

  • i.e. a large set of measured variables is resolved into a few categories called factors

    ARVIND BANGER 11

  • MATHEMATICAL BASIS OF FACTOR ANALYSIS

    SCORE MATRIX

    Measures (variables)

    a b c . k

    1 a1 b1 c1 . k1

    objects 2 a2 b2 c2 . k2

    3 a3 b3 c3 . k3

    . . . . . .

    . . . . . .

    N aN bN cN . kN

    ARVIND BANGER 12

  • FEW BASIC TERMS FACTOR : It is an underlying dimension that account for

    several observed variables.

    FACTOR LOADING : They are the values which explain how closely the variables are related to each one of the factors discovered. Its absolute size helps in interpreting the factor.

    COMMONALITY (h2) : It shows how much of each factor is accounted for by the underlying factor taken together.

    h2 of the variable=(ith factor loading of factor A)2+(ith

    factor loading of factor B)2

    ARVIND BANGER 13

  • Contd EIGEN VALUE : It is the sum of squared values of the factor

    loadings relating to a factor. It indicates relative importance of each factor in accounting the particular set of variables being analyzed.

    TOTAL SUM OF SQUARES : It is the sum of squared values of factor loadings related to a factor.

    FACTOR SCORES : These represent the degree to which each respondent gets high scores on the group of items that load high on each factor.

    ARVIND BANGER 14

  • METHODS OF FACTOR ANALYSIS

    Centroid method

    Principal-components method

    Maximum likelihood method

    ARVIND BANGER 15

  • CENTROID METHOD This method tends to maximize the sum of loadings

    ILLUSTRATION : Given the following correlation matrix, R, relating to 8

    variables with unities in the diagonal spaces. Work out the first & second centroid factors:

    ARVIND BANGER 16

  • Variables

    1 2 3 4 5 6 7 8

    1 1 0.709 0.204 0.081 0.262 0.113 0.155 0.774

    2 0.709 1 0.051 0.089 0.581 0.098 0.083 0.652

    3 0.204 0.051 1 0.671 0.123 0.689 0.582 0.072

    Variables 4 0.081 0.089 0.671 1 0.022 0.798 0.613 0.111

    5 0.262 0.581 0.123 0.22 1 0.047 0.201 0.724

    6 0.113 0.098 0.689 0.798 0.047 1 0.891 0.12

    7 0.155 0.083 0.582 0.613 0.201 0.801 1 0.152

    8 0.774 0.652 0.072 0.111 0.724 0.12 0.152 1

    ARVIND BANGER 17

  • SOLUTION: As the matrix is positive manifold, the weights of

    various variables are taken as (+1) i.e. variables are simply summed.

    a) The sum of coefficients in each column of the

    correlation matrix are worked out.

    b) The sum of these columns (T) is obtained.

    c) The sum of each column obtained as per (a) is divided by the square root of T obtained in (b), to get centroid loadings. The full set of loadings so obtained constitute the first centroid factor.( say A)

    ARVIND BANGER 18

  • Variables

    1 2 3 4 5 6 7 8

    1 1 0.709 0.204 0.081 0.262 0.113 0.155 0.774

    2 0.709 1 0.051 0.089 0.581 0.098 0.083 0.652

    3 0.204 0.051 1 0.671 0.123 0.689 0.582 0.072

    Variables 4 0.081 0.089 0.671 1 0.022 0.798 0.613 0.111

    5 0.262 0.581 0.123 0.22 1 0.047 0.201 0.724

    6 0.113 0.098 0.689 0.798 0.047 1 0.891 0.12

    7 0.155 0.083 0.582 0.613 0.201 0.801 1 0.152

    8 0.774 0.652 0.072 0.111 0.724 0.12 0.152 1

    column sum(Si) 3.662 3.263 3.392 3.385 3.324 3.666 3.587 3.605

    Sum of column sums (T) = 27.884, T=5.281

    First centroid

    factor A, Si /T 0.693 0.618 0.642 0.641 0.629 0.694 0.679 0.683

    ARVIND BANGER 19

  • We can also state the information as:-

    Variables Factor loadings concerning

    first centroid factor A

    1 0.693

    2 0.618

    3 0.642

    4 0.641

    5 0.629

    6 0.694

    7 0.679

    8 0.683 ARVIND BANGER 20

  • FINDING SECOND CENTROID FACTOR : The loadings for the variables on the first centroid factor are multiplied. This is done for all possible pairs of variables resulting matrix is named as Q1

    ARVIND BANGER 21

  • First Matrix of Factor cross product (Q1)

    0.693 0.618 0.642 0.641 0.629 0.694 0.679 0.683

    0.693 0.48 0.428 0.445 0.444 0.436 0.481 0.471 0.437

    First centroid 0.618 0.428 0.382 0.397 0.396 0.389 0.429 0.42 0.422

    factor A 0.642 0.445 0.397 0.412 0.412 0.404 0.446 0.436 0.438

    0.641 0.444 0.396 0.412 0.411 0.403 0.445 0.435 0.438

    0.629 0.436 0.389 0.404 0.403 0.396 0.437 0.427 0.43

    0.694 0.481 0.429 0.446 0.445 0.437 0.482 0.471 0.474

    0.679 0.471 0.42 0.436 0.435 0.428 0.471 0.461 0.464

    0.683 0.473 0.422 0.438 0.438 0.43 0.474 0.464 0.466 ARVIND BANGER 22

  • Now, Q1 is subtracted element by element from the original matrix R, resulting in matrix of residual coefficients R1.

    ARVIND BANGER 23

  • First matrix of residual coefficient(R1 )

    1 2 3 4 5 6 7 8

    1 0.52 0.281 -0.24 -0.36 0.19 -0.37 -0.32 0.301

    Variables 2 0.281 0.618 -0.35 -0.31 0.192 -0.33 -0.34 0.23

    3 -0.24 -0.35 0.588 0.259 -0.28 0.43 0.146 -0.37

    4 -0.36 -0.31 0.259 0.589 -0.38 0.353 0.178 -0.33

    5 0.19 0.192 -0.28 -0.38 0.604 -0.39 -0.22 0.294

    6 -0.37 -0.33 0.243 0.353 -0.39 0.518 0.33 -0.35

    7 -0.32 -0.34 0.146 0.178 -0.23 0.33 0.539 -0.31

    8 0.301 -0.23 -0.37 -0.33 0.294 -0.35 -0.31 0.534

    ARVIND BANGER 24

  • Now reflecting the variables 3, 4, 6, 7 ,we obtain reflected matrix of residual coefficients R1 as given below. Again the same method is repeated to get the centroid factor B

    ARVIND BANGER 25

  • First matrix of residual coefficient(R1)

    1 2 3* 4* 5 6* 7* 8

    1 0.52 0.281 0.241 0.363 0.19 0.368 0.316 0.301

    Variables 2 0.281 0.618 0.346 0.307 0.192 0.331 0.337 0.23

    3* 0.241 0.346 0.588 0.259 0.281 0.43 0.146 0.366

    4* 0.363 0.307 0.259 0.589 0.381 0.353 0.178 0.327

    5 0.19 0.192 0.281 0.381 0.604 0.39 0.217 0.294

    6* 0.368 0.331 0.243 0.353 0.39 0.518 0.33 0.354

    7* 0.316 0.337 0.146 0.178 0.226 0.33 0.539 0.312

    8 0.301 0.23 0.366 0.327 0.294 0.354 0.312 0.534

    2.58 2.642 2.47 2.757 2.558 3.074 2.375 2.718

    Sum of column sums(T)=20.987 , T=4.581

    ARVIND BANGER 26

  • We can now write the matrix of factor loadings as under: Variables Factor loadings

    Centroid factor A Centroid factor B

    1 0.693 0.563

    2 0.618 0.577

    3 0.642 -0.539

    4 0.641 -0.602

    5 0.629 0.558

    6 0.694 -0.63

    7 0.678 -0.518

    8 0.683 0.593

    ARVIND BANGER 27

  • Hence centroid factor B, and commonality(h2) is as follows

    Variables Factor loadings Commonality(h2)

    Centroid factor A Centroid factor B A2+B2

    1 0.693 0.563 0.797

    2 0.618 0.577 0.715

    3 0.642 -0.539 0.703

    4 0.641 -0.602 0.773

    5 0.629 0.558 0.707

    6 0.694 -0.63 0.879

    7 0.678 -0.518 0.729

    8 0.683 0.593 0.818

    ARVIND BANGER 28

  • Proportion of variance

    Variables Factor loadings Commonality(h2)

    Centroid factor A Centroid factor B

    Eigen value 3.49 2.631 6.121

    Proportion of 0.44 0.33 0.77

    total variance [44%] [33%] [77%]

    Proportion of 0.57 0.43 1

    common variance [57%] [43%] [100%]

    ARVIND BANGER 29

  • PRINCIPAL-COMPONENTS METHOD This method seeks to maximize the sum of squared loadings

    of each factor .Hence the factors in this method explain more variance than the loadings obtained from any other method of factoring.

    Principal components are constructed which are linear combination of given set of variables.

    p1 = a11X1+a12X2+.+a1nXn p2= a21X1+a22X2+.+a2nXn and so on till pn The aijs are called loadings and worked out in such a way that

    PC are uncorrelated(orthogonal) and first PC has maximum variance.

    ARVIND BANGER 30

  • ILLUSTRATION:

    Take the correlation matrix R for 8 variables and compute:

    (i) the first two principal component factors.

    (ii) the communality for each variable on the basis of said two component factors.

    (iii) the proportion of total variance as well as the proportion of common variance explained by each of the two component factors.

    ARVIND BANGER 31

  • SOLUTION:

    As the correlation matrix is positive manifold we work out the 1st principal component factor as follows:

    The vector of column sums is referred to as Ua1 and when it is normalized by, we call it Va1.

    To normalize: square the column sums in Ua1 and then divide each element in Ua1 by the square root of the sum of squares.

    ARVIND BANGER 32

  • Variables

    1 2 3 4 5 6 7 8

    1 1 0.709 0.204 0.081 0.262 0.113 0.155 0.774

    2 0.709 1 0.051 0.089 0.581 0.098 0.083 0.652

    3 0.204 0.051 1 0.671 0.123 0.689 0.582 0.072

    Variables 4 0.081 0.089 0.671 1 0.022 0.798 0.613 0.111

    5 0.262 0.581 0.123 0.22 1 0.047 0.201 0.724

    6 0.113 0.098 0.689 0.798 0.047 1 0.891 0.12

    7 0.155 0.083 0.582 0.613 0.201 0.801 1 0.152

    8 0.774 0.652 0.072 0.111 0.724 0.12 0.152 1

    column sum Ua1 3.662 3.263 3.392 3.385 3.324 3.666 3.587 3.605

    Va1 =

    Ua1/normalizing

    factor 0.371 0.331 0.334 0.343 0.337 0.372 0.363 0.365

    ARVIND BANGER 33

  • Normalizing factor: ={(3.622)2+(3.263)2+(3.392)2+(3.385)2+(3.324)2+(3.666)2+(3.587)2+ (3.605)2} =9.868

    We now obtain Ua2 by accumulatively multiplying Va1 row by row into R resulting in:

    Ua2 : [1.296, 1.143, 1.201, 1.201, 1.165, 1.308,1.280, 1.275]

    Normalizing it we obtain:

    Va2 : [0.371, 0.327, 0.344, 0.344, 0.344, 0.374,0.366,0.365] Va1 and Va2 are almost equal i.e. convergence has occurred .

    Finally we compute the loadings as follows:

    ARVIND BANGER 34

  • Variables (Characteristic *

    normalizing factor Principal

    vector Va1) of Ua2 = component

    1 0.371 * 1.868 = 0.69

    2 0.331 * 1.868 = 0.62

    3 0.334 * 1.868 = 0.64

    4 0.343 * 1.868 = 0.64

    5 0.337 * 1.868 = 0.63

    6 0.372 * 1.868 = 0.70

    7 0.363 * 1.868 = 0.68

    8 0.365 * 1.868 = 0.68

    ARVIND BANGER 35

  • We now find principal component II (acc. to method followed to obtain centroid factor B earlier) to get:

    Variables Principal component

    II

    1 0.57

    2 0.59

    3 -0.52

    4 -0.59

    5 0.57

    6 -0.61

    7 -0.49

    8 -0.61 ARVIND BANGER 36

  • Variables Principal components commonality(h2)

    I II I2+II2

    1 0.69 0.57 0.801

    2 0.62 59 0.733

    3 0.64 -0.52 0.68

    4 0.64 -0.59 0.758

    5 0.63 0.57 0.722

    6 0.7 -61 0.862

    7 0.68 -0.49 0.703

    8 0.68 -0.61 0.835

    Eigen value 3.4914 2.6007 6.0921

    Proportion of 0.436 0.325 0.761

    total variance 43.6% 32.5% 76%

    Proportion of 0.537 0.427 1.00

    common variance 57% 43% 100% ARVIND BANGER 37

  • MAXIMUM LIKELIHOOD METHOD

    If Rscorrelation matrix actually obtained from data in

    sample

    & Rpcorrelation matrix obtained if entire population is

    tested.

    then ML method seeks to extrapolate what is known in Rs in the best possible way to estimate Rp.

    ARVIND BANGER 38

  • CLUSTER ANALYSIS Unlike techniques for analyzing relationship between

    variables it attempts to reorganize a differentiated group of people, events or objects into homogenous subgroups.

    STEPS: Selection of sample to be clustered (buyers, products etc.). Definition of the variables on which to measure the

    objects, events etc. (e.g. market segment characteristics, product competition definitions etc.)

    Computation of similarities among entities (through correlation, euclidean distances, and other techniques)

    Selection of mutually exclusive clusters(maximisation of within cluster similarity and between cluster differences).

    Cluster comparison & validation.

    ARVIND BANGER 39

  • A minivan buyers

    B sports car buyers Income

    Age

    Family size

    A

    B

    CA used to segment car buying population into distinct markets

    ARVIND BANGER 40

  • MULTIDIMENSIONAL SCALING (MDS)

    This creates a special description of a respondents perception about a product or service and helps business researcher to understand difficult to measure construct like product quality.

    Method: We may take three type of attribute space, each representing a

    multidimensional space

    1. Objective space :object positioned in terms of measurable attributes like objects weight, flavor and nutritional value.

    2. Subjective space : perceptions about objects flavor, weight and nutritional value.

    3. Preference space: describes respondents preferences using objects attributes (ideal point).All objects close to this ideal point are interpreted as preferred .

    Ideal points from many people can be positioned in this preference space to reveal the pattern and size of preference clusters. Thus Cluster analysis and MDS can be combined to map market segments and then examine product designed for those segments.

    ARVIND BANGER 41

  • CONJOINT ANALYSIS

    Used in market research & product development

    Takes non-metric & independent variables as input

    E.g. considering purchase decision of a computer, if we have 3 prices, 3 brands, 3 speeds, 2 levels of educational values, 2 categories of games, & 2 categories of work assistance, then model will have (3*3*3*2*2*2)=216 decision levels

    Objective of Conjoint analysis is to secure Utility Scores, that represent the importance of each aspect of the product in buyers overall preference rating.

    Utility Scores are computed from buyers ratings of set of cards . Each card in the deck describes one possible configuration of combined product attributes.

    ARVIND BANGER 42

  • Steps followed in Conjoint Analysis

    Select the attribute most pertinent to the purchase decision (called factor).

    Find the possible values of attribute (called factor levels).

    After selecting the factors and their levels SPSS determines the No. of product descriptions necessary to estimate the utilities. It also builds a file structure for all possible combinations , generate the subset required for testing , produce the card description and analysis the results.

    ARVIND BANGER 43

  • THANK YOU

    ARVIND BANGER 44