Multivariate Ana
Transcript of Multivariate Ana
-
7/29/2019 Multivariate Ana
1/20
1
MULTIVARIATE ANALYSIS Statistical techniques that simultaneously analyze
more than two variables
Multivariate techniques two categories1. Dependency techniques deal with one ormore dependent variables
One dependent variable - data metric Multiple Regression Analysis
Several dependent variable data metric Discriminant Analysis
2. Interdependency techniques More than two variables
Variables not segregated as dependent andindependent variables Interrelationships between the variables are
analyzed Data metric Factor Analysis, Cluster
Analysis
-
7/29/2019 Multivariate Ana
2/20
2
MULTIVARIATE ANALYSIS
Multiple RegressionUsed to analyse quantitative data
To study cause and effect
relationship between a singledependent variable and two or morethan two independent variables
Used mainly for prediction/forecasting
-
7/29/2019 Multivariate Ana
3/20
3
Greek letters are used for a (a) and b (b) when
denoting population parameters.
Y a b X b X b X k k
' ... 1 1 2 2
The general multiple regression with k
independent variables is given by:
X1 to Xkare the independent variables.
a is theY-intercept.
bj is called a partial regression coefficient. It is the
net change in Y for each unit change in Xj holding all
other variables constant, (where j=1 to k)
Multiple Regression Analysis
-
7/29/2019 Multivariate Ana
4/20
4
Successive values of thedependent variable mustbe uncorrelated or notautocorrelated.
ASSUMPTIONS IN MULTIPLE REGRESSION
The independent variables
and the dependent variable
have a linear relationship.
The dependent
variable must becontinuous and atleast interval-scaled.
The variation in (Y-Y) orresidual must be the samefor all values ofY. Whenthis is the case, we say thedifference exhibitshomoscedasticity.
The residuals should
follow the normaldistributed with mean 0.
-
7/29/2019 Multivariate Ana
5/20
Correlation Matrix
A correlation matrix is used to show all possiblesimple correlation coefficients among the variables.
The matrix is useful for locating correlatedindependent variables.
It shows how strongly each independent variable is
correlated with the dependent variable.
CorrelationCoefficients Cars Advertising Sales force
Cars 1.000
Advertising 0.808 1.000
Sales force 0.872 0.537 1.000
-
7/29/2019 Multivariate Ana
6/20
6
Because determining b1, b2, etc. is
very tedious, a software package
such as Excel or MINITAB may be used.
The least squares criterion is used
to develop this equation.
Multiple Regression Analysis
-
7/29/2019 Multivariate Ana
7/20
7
ANOVA TABLE
Source df SS MSRegression k-1 SSRS(YY)2 SSR/(k-1)Error n-k-1 SSE
S(Y-Y)2SSE/(n-k-1)
Total n-k-1 SS Total
S(Y-Y)Total Variation
ANOVA Explained Variation
Unexplained or Random Variation
Variation not accounted for by the
independent variables.
Variation
accounted
for by theset of
independent
variables.
-
7/29/2019 Multivariate Ana
8/20
8
A market researcher for Super
Markets is studying the yearly
amount families of four or more
spend on food. Three
independent variables are
thought to be related to yearlyfood expenditures (Food). Those
variables are: total family
income (Income) in $00, size offamily (Size), and whether the
family has children in college
(College).
EXAMPLE 1
-
7/29/2019 Multivariate Ana
9/20
9
The variable college is called a dummy orindicator variable. It can take only one ofthe two possible outcomes i.e. a child is a
college student or not.Examples of dummy variables: gender, thepart is acceptable or not, the voter will orwill not vote for the incumbent governor etc.
We usually code one value of the dummyvariable as 1 and the other 0.
Expenditure = a + b1*(Income) +b2(Size) + b3(College)
-
7/29/2019 Multivariate Ana
10/20
10
Example 1 continued
Fam ily Food Incom e Size Student
1 3900 376 4 0
2 5300 515 5 13 4300 516 4 0
4 4900 468 5 0
5 6400 538 6 1
6 7300 626 7 17 4900 543 5 0
8 5300 437 4 0
9 6100 608 5 1
10 6400 513 6 111 7400 493 6 1
12 5800 563 5 0
-
7/29/2019 Multivariate Ana
11/20
11
Example 1continued
From the analysis provided by MINITAB,
the estimated multiple regression equation is:
Y=954 +1.09X1 + 748X2 + 565X3
? What food expenditure would you estimate for afamily of 4, with no college student, and an income
of $50,000 (which is input as 500)?
Food Expenditure= 954+1.09*income+748*size
+565*college
-
7/29/2019 Multivariate Ana
12/20
12
Example 1 continued
Each additional $100 dollars of income per year willincrease the amount spent on food by $109 per year.
An additional family member will increase the
amount spent per year on food by $748.A family with a college student will spend $565 moreper year on food than those without a college student.
Food Expend.=$954+$1.09*income+$748*size+$565*college
So a family of 4, with no college students, and an
income of $50,000 will spend an estimated $4,491.
Food Expend.=$954+$1.09*500+$748*4+$565*0
-
7/29/2019 Multivariate Ana
13/20
13Example 1 continued
The regression equation is
Food = 954 + 1.09 Income + 748 Size + 565 Student
Predictor Coef SE Coef T P
Constant 954 1581 0.60 0.563
Income 1.092 3.153 0.35 0.738
Size 748.4 303.0 2.47 0.039
Student 564.5 495.1 1.14 0.287
S = 572.7 R-Sq = 80.4% R-Sq(adj) = 73.1%
Analysis of Variance
Source DF SS MS F P
Regression 3 10762903 3587634 10.94 0.003
Residual Error 8 2623764 327970
Total 11 13386667
-
7/29/2019 Multivariate Ana
14/20
14
Correlation matrix
The coefficient ofdetermination is80.4percent. Thismeans that morethan 80 percent ofthe variation in the
amount spent onfood is accountedfor by the variablesincome, family
size, and student.
The strongest correlation between the dependentvariable and an independent variable is betweenfamily size and amount spent on food.
Food Income Size College
Food 1.000
Income 0.587 1.000
Size 0.876 0.609 1.000
College 0.773 0.491 0.743 1.000
-
7/29/2019 Multivariate Ana
15/20
15Example 1 continued
H H0 2 1 2
0 0: :b b
Conduct an individual test to determinewhich coefficients are not zero. This is the
hypotheses for the independent variablefamily size.
From the MINITABoutput, the onlysignificant variable is
FAMILY (family size)using the p-values.The other variables canbe omitted from the
model.
Thus, using the 5%level of
significance, rejectH0 if the p-value