Avishai Wool lecture 7 - 1 Introduction to Systems Programming Lecture 7 Paging.
Lecture 7 guidelines_and_assignment
-
Upload
daria-bogdanova -
Category
Documents
-
view
23 -
download
0
Transcript of Lecture 7 guidelines_and_assignment
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 1
Table of Contents
LECTURE 7 ..................................................................................................................................................... 2
CHI-SQUARE TEST (CROSS-TAB) .................................................................................................................... 2
SPSS OUTPUT ............................................................................................................................................ 3
REPORTING THE RESULT ........................................................................................................................... 3
DISCRIMINANT ANALYSIS .............................................................................................................................. 4
SPSS OUTPUT ............................................................................................................................................ 5
REPORTING THE RESULT ........................................................................................................................... 9
LOGISTIC REGRESSION .................................................................................................................................. 9
SPSS OUTPUT .......................................................................................................................................... 10
REPORTING THE RESULT ......................................................................................................................... 15
ASSIGNMENT 7 ............................................................................................................................................ 16
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 2
LECTURE 7
CHI-SQUARE TEST (CROSS-TAB)
A group of students were classified in terms of personality (introvert or extrovert) and in terms of colour preference (red, yellow, green or blue). Personality and colour preference are categorical variables. We want to find answer to this question:
Is there an association between personality and colour preference?
In SPSS, Analyze > Descriptive Statistics > Crosstab
Move the variable person to the Row(s) and colour to the Column (s) areas.
Click on the Statistics button and select the Chi-square option. Click Continue to proceed to the next step.
Click on the Cells button, under Counts: choose Observed and Expected. For Percentages: select Row, Colum, and Total.
After finish, click Continue to proceed and OK to run the analysis.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 3
SPSS OUTPUT
The table personality type*favorite colour Crosstabulation shows us the percentages of each level of the two variables.
The Chi-square Tests table gives us the significance of the test.
Chi-Square Tests
Value df Asymp. Sig. (2-
sided)
Pearson Chi-Square 71.200a 3 .000
Likelihood Ratio 70.066 3 .000
Linear-by-Linear Association 69.124 1 .000
N of Valid Cases 400
REPORTING THE RESULT
We can write a conclusion like this:
There is a relationship between students’ personality and preferences for colours: χ² (3, N = 400) =
71.20, p < .0001.
personality type * favourite colour Crosstabulation
favourite colour
Total red yellow green blue
personality type introvert Count 20 6 30 44 100
Expected Count 50.0 10.0 20.0 20.0 100.0
% within personality type 20.0% 6.0% 30.0% 44.0% 100.0%
% within favourite colour 10.0% 15.0% 37.5% 55.0% 25.0%
% of Total 5.0% 1.5% 7.5% 11.0% 25.0%
extrovert Count 180 34 50 36 300
Expected Count 150.0 30.0 60.0 60.0 300.0
% within personality type 60.0% 11.3% 16.7% 12.0% 100.0%
% within favourite colour 90.0% 85.0% 62.5% 45.0% 75.0%
% of Total 45.0% 8.5% 12.5% 9.0% 75.0%
Total Count 200 40 80 80 400
Expected Count 200.0 40.0 80.0 80.0 400.0
% within personality type 50.0% 10.0% 20.0% 20.0% 100.0%
% within favourite colour 100.0% 100.0% 100.0% 100.0% 100.0%
% of Total 50.0% 10.0% 20.0% 20.0% 100.0%
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 4
DISCRIMINANT ANALYSIS
A study is set up to determine if the following variables help to discriminate between those who smoke and those whose don’t:
age absence (days of absence last year) selfcon (self-concept score) anxiety (anxiety score) anti_smoking (attitude towards anti-smoking policies)
In SPSS, Analyze > Classify > Discriminant
Move the categorical variable smoke into the Grouping Variable, and age, selfcon, anxiety, absence, and anti_smoking into the Independent area.
Click on Define Range to indicate the values that we have assigned for each group. In our case, 1 is for non-smokers and 2 is for smokers.
Click Continue to proceed to the next step.
Click on Statistics to access the Statistics dialog box.
In the Descriptive statistics, select Means, Univariate ANOVAs (which tests whether the means are different between groups), and Box’s M (which tests whether the homogeneity of variances between groups, in this aspect, it’s very similar to Levene’s test).
Under the Function Coefficients, choose Unstandardized, then click Continue to proceed.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 5
Click on the Classify to access the Classification dialog box. For unequal group size (the numbers between smokers and non-smokers are not equal), we will select Compute from group sizes for the Prior Probabilities.
Under the Use Covariance Matrix, select Within-groups. The important options we should select are Summary Tables (which shows us how many percentages of cases are correctly classified based on the model) and Leave-one-out classification (cross-validation of the model).
It’s useful to select Separate-groups under the Plots as this will plot the variate score for each participant grouped according to whether they are classified as smokers or non-smokers. For more than 2 groups, the preferred option is Combined-groups.
Click Continue to proceed and select the Save option. We will choose Predicted group membership and Discriminant scores. This process will create 2 new variables in the data set.
Click Continue to proceed and OK to run the analysis.
SPSS OUTPUT
The first important table is obtained from the option Univariate ANOVAs, indicating that the mean scores across the 5 predictors are significantly different between smokers and non-smokers. This is positive, i.e. we can use them as discriminants.
Tests of Equality of Group Means
Wilks' Lambda F df1 df2 Sig.
self concept score .526 392.672 1 436 .000
anxiety score .666 218.439 1 436 .000
days absent last year .931 32.109 1 436 .000
total anti-smoking test score .887 55.295 1 436 .000
age .980 8.781 1 436 .003
The Box’s M tests the null hypothesis that the covariance matrices do not differ between groups and we want this test to be non-significant. The output shows that p <.001, which is highly significant. However, as our sample size is big (n = 438), so this is not a big problem.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 6
Test Results
Box's M 176.474
F Approx. 11.615
df1 15
df2 600825.345
Sig. .000
Tests null hypothesis of equal population covariance matrices.
Next, the important value we should look at in the Eigenvalues table is the canonical correlation, which is the multiple correlation between the predictors and the discriminant function (Burns & Burns, 2008). Therefore, it’s interpreted as the R
2 in multiple regression. With R = .802, it can be concluded that the
model explains 64.32% of the variation in the grouping variable (smokers or non-smokers).
Eigenvalues
Function Eigenvalue % of Variance Cumulative % Canonical Correlation
1 1.806a 100.0 100.0 .802
a. First 1 canonical discriminant functions were used in the analysis.
The Wilk’s Lamda is a statistic that tests the significance of the discriminant function, which we will come across later. The value of Wilks’ Lamda is simply the unexplained variance. We can consider it as a goodness of fit statistic, and this test should be significant to confirm the model.
Wilks' Lambda
Test of Function(s) Wilks' Lambda Chi-square df Sig.
1 .356 447.227 5 .000
The standardized canonical discriminant function coefficients table gives us a reference with regard to the importance of each predictor. According to the result, self-concept and anxiety scores are the strongest predictors.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 7
Standardized Canonical Discriminant Function Coefficients
Function
1
self concept score .763
anxiety score -.614
days absent last year -.073
total anti-smoking test score .378
age .212
The coefficients in the Canonical Discriminant Function Coefficients table are used to create the discriminant function to predict group membership.
Canonical Discriminant Function Coefficients
Function
1
age .024
self concept score .080
anxiety score -.100
days absent last year -.012
total anti-smoking test score .134
(Constant) -4.543
Unstandardized coefficients
We can write as follows:
D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent) + (.134 anti-smoking score) - 4.543
The Functions at group Centroids gives us the group means calculated by using the discriminant function, called centroids.
Functions at Group Centroids
smoke or not
Function
1
non-smoker 1.125
smoker -1.598
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 8
If we look at the two bar charts created based on discriminant scores calculated for each participant, we can see that the means are simply the centroids.
Finally, we will find how many percentage of the cases are correctly classified based on the discriminant function. There are 238 non-smokers out of 257 and 164 smokers out of 181 are correctly classified (91.8%).
Classification Resultsa,c
smoke or not
Predicted Group Membership
Total non-smoker smoker
Original Count non-smoker 238 19 257
smoker 17 164 181
% non-smoker 92.6 7.4 100.0
smoker 9.4 90.6 100.0
Cross-validatedb Count non-smoker 238 19 257
smoker 17 164 181
% non-smoker 92.6 7.4 100.0
smoker 9.4 90.6 100.0
a. 91.8% of original grouped cases correctly classified.
b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.
c. 91.8% of cross-validated grouped cases correctly classified.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 9
REPORTING THE RESULT
When reporting the result, we should include the following:
Name of the predictors and sample size Results of the Univariate ANOVAs and the Box’s M test The significance of the discriminant function The variance explained (Canonical correlation coefficient) Significant predictors and their contribution to the model (discriminant function) Result from the cross-validation process
You can write:
A discriminant analysis was conducted to predict whether an employee was a smoker or not. Predictor variables were age, number of days from work in previous year, self-concept score, anxiety score, and attitude to anti-smoking workplace policy. A total of 438 cases were analyzed. Univariate ANOVAs revealed that the smokers and non-smokers differed significantly on each of the five predictor variables. Box’s M indicated that the assumption of equality of covariance matrices was violated. However, given the large sample, this problem is not regarded as serious. The discriminant function revealed a significant association between groups and all predictors (Λ = .356, χ² = 447.23, df = 5, p < .001), accounting for 64.32% of between group variability, although closer analysis of the structure matrix revealed only two significant predictors, namely self-concept score (.706) and anxiety score (–.527) with age and absence being poor predictors. The cross validated classification showed that overall 91.8% were correctly classified.
LOGISTIC REGRESSION
Logistic regression can also be used to predict group membership. The differences are:
Logistic regression does not require that the predictor variables should be normally distributed. Logistic regression can handle categorical predictors whereas discriminant analysis can only
work with scale variables. Logistic regression gives us the odds ratio to explain how likely an event occurs as an increase in
the predictor’s unit of measure. In this respect, logistics regression helps to explain the mechanism of the change in membership, rather than providing a cut-off value for discriminating between the two groups.
When we have large sample size and equal groups, as well as normally distributed data points, discriminant analysis is recommended as it is more powerful than logistic regression.
We will try to answer the same research question: “Which predictor variables help discriminate between groups (smokers and non-smokers)?”
But this time we will also add gender as a predictor and use Stepwise (Forward: Likelihood Ratio) as the regression method.
In SPSS, Analyze > Regression > Binary Logistic
Move the variable smoke into the Dependent area and age, gender, selfcon, anxiety, absence, and anti_smoking into the Covariates area.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 10
Click on the Categorical option to define the categorical variable in the analysis. In the Reference Category, we can choose either Last or First depending on how we code the variable.
Click Continue to proceed to the next step, then click Save to access the Save dialog box. Then select Probabilities, Group membership and Standardized in the Predicted Values and the Residuals respectively. After finish, click Continue.
Click Options to access the next dialog box. The default options are suggested, but probably we may want to know if there are any outliers greater than 2 standard deviations and how may iterations it takes for SPSS to come up with the solution.
Click Continue to proceed and OK to run the analysis.
SPSS OUTPUT
The Classification Table is the percentage of accurate classification using the baseline model, which predicts group membership by assigning all participants to one category with a larger number.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 11
Block 0. When no predictors are entered
Classification Tablea,b
Observed Predicted
smoke or not
Percentage Correct non-smoker smoker
Step 0 smoke or not non-smoker 257 0 100.0
smoker 181 0 .0
Overall Percentage 58.7
a. Constant is included in the model.
b. The cut value is .500
The table Variables in the Equation shows that without any predictor, using the constant only, the constant b0 is -.351 and interestingly this model is significant. However as the percentage of accurate classification is not very high, we will consider adding more predictor into the prediction model.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant -.351 .097 13.053 1 .000 .704
The Variables not in the Equation shows that all the variables (except gender) are likely to contribute to the model, except gender with a p > .05 (non-significant).
Variables not in the Equation
Score df Sig.
Step 0 Variables selfcon 207.549 1 .000
anxiety 146.196 1 .000
absence 30.043 1 .000
anti_smoking 49.297 1 .000
gender(1) .001 1 .981
Overall Statistics 277.460 5 .000
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 12
Block 1. Enter predictors into the model step by step, with the most important one being selected first
We have 5 variables, and if we look at the table Omnibus Tests of Model Coefficients (which compare how significantly better our new models is compared with the baseline model), we know that only 4 variables are selected because only these can significantly improve the predictive power.
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 298.660 1 .000
Block 298.660 1 .000
Model 298.660 1 .000
Step 2 Step 88.665 1 .000
Block 387.326 2 .000
Model 387.326 2 .000
Step 3 Step 15.120 1 .000
Block 402.446 3 .000
Model 402.446 3 .000
Step 4 Step 6.643 1 .010
Block 409.089 4 .000
Model 409.089 4 .000
Gender does not significantly contribute to the model, hence is removed, which can be identified in the table labeled as Variables not in the Equation.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 13
Variables not in the Equation
Score df Sig.
Step 1 Variables anxiety 72.171 1 .000
absence 4.449 1 .035
anti_smoking 23.145 1 .000
gender(1) .000 1 .997
Overall Statistics 90.107 4 .000
Step 2 Variables absence 1.938 1 .164
anti_smoking 14.265 1 .000
gender(1) .007 1 .935
Overall Statistics 20.051 3 .000
Step 3 Variables absence 6.460 1 .011
gender(1) .005 1 .945
Overall Statistics 7.246 2 .027
Step 4 Variables gender(1) .858 1 .354
Overall Statistics .858 1 .354
The Model Summary table tells us 2 things:
One is the -2 Log likelihood (-2LL) after each step. We expect there is a decrease in the -2LL as this will show that the model’s prediction power is improved. This should be accompanied by a significant Chi-square test which can be found in the Hosmer and Lemeshow Test table.
Second, the R square values calculated by 2 different approaches and hence can be different.
Model Summary
Step -2 Log likelihood Cox & Snell R
Square Nagelkerke R
Square
1 295.282a .494 .666
2 206.617b .587 .791
3 191.497b .601 .810
4 184.854c .607 .818
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 14
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 26.262 8 .001
2 54.184 8 .000
3 27.230 8 .001
4 42.041 8 .000
The table Variables in the Equation shows the coeffcients (B) which can be used in the model to predict group membership and the odds ratio (Exp (B)).
Interpreting the odds ratio:
If the odds ratio > 1: when the predictor increases, the odds of the event occurs increase. In our case, it can be interpreted like this: the odds of someone falls into the smoker group is 1.26 times higher for those with one score higher in anxiety.
If the odds ratio < 1: when the predictor increases, the odds of the event occurs decreases. So, for every extra score in the anti-smoking test, the odds of someone falls into the smoker group is reduced by a factor of .739
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a selfcon -.240 .023 110.915 1 .000 .787 .752 .823
Constant 8.221 .789 108.483 1 .000 3717.359
Step 2b selfcon -.253 .029 77.035 1 .000 .777 .734 .822
anxiety .253 .036 49.929 1 .000 1.288 1.201 1.382
Constant 2.749 1.024 7.214 1 .007 15.629
Step 3c selfcon -.249 .030 67.724 1 .000 .780 .735 .827
anxiety .231 .035 44.068 1 .000 1.260 1.177 1.349
anti_smoking -.244 .067 13.247 1 .000 .783 .687 .893
Constant 8.338 1.946 18.367 1 .000 4179.821
Step 4d selfcon -.260 .033 63.281 1 .000 .771 .724 .822
anxiety .236 .036 44.213 1 .000 1.266 1.181 1.357
absence .075 .030 6.214 1 .013 1.078 1.016 1.144
anti_smoking -.303 .075 16.286 1 .000 .739 .638 .856
Constant 9.257 2.050 20.398 1 .000 10480.856
a. Variable(s) entered on step 1: selfcon.
b. Variable(s) entered on step 2: anxiety.
c. Variable(s) entered on step 3: anti_smoking.
d. Variable(s) entered on step 4: absence.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 15
Finally, the Classification Table tells us the percentage of accurate prediction of group membership. We find that 238 non-smokers and 164 smokers are correctly classified, accounting for 91.8%.
Classification Tablea
Observed
Predicted
smoke or not
Percentage Correct non-smoker smoker
Step 1 smoke or not non-smoker 225 32 87.5
smoker 26 155 85.6
Overall Percentage 86.8
Step 2 smoke or not non-smoker 234 23 91.1
smoker 26 155 85.6
Overall Percentage 88.8
Step 3 smoke or not non-smoker 238 19 92.6
smoker 21 160 88.4
Overall Percentage 90.9
Step 4 smoke or not non-smoker 238 19 92.6
smoker 17 164 90.6
Overall Percentage 91.8
a. The cut value is .500
As SPSS code non-smokers as 0 and smokers as 1, the cut value is .500, meaning that using the coefficients to calculate the probability of someone belonging to one group, a value less than .500 will suggest that this person belongs to the non-smoker group.
REPORTING THE RESULT
We can write a report like this:
A logistic analysis was conducted age, gender, number of days from work in previous year, self-concept score, anxiety score, and attitude to anti-smoking workplace policy as predictors. A total of 438 cases were analyzed. The full model significantly predicted whether an employee is a smoker or non-smoker (χ² = 42.04, df = 8, p < .001), accounting for between 60.7% and 81.8% on the variance in the group membership with 92.6% non-smokers and 90.6% smokers successfully predicted. Table 1 presents the beta values, their standard errors and significance value, the odds ratios and its confidence interval.
Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines
Prof. Dr. Chang Zhu page 16
Table 1
Results Of The Logistic Regression
B S.E. Odds Ratio
95% C.I. for Odds Ratio
Lower Upper
constant 9.257**
2.050 10480.856
self-concept
-.260** .033 .771 .724 .822
anxiety
.236** .036 1.266 1.181 1.357
absence
.075
* .030 1.078 1.016 1.144
anti-smoking test score
-.303** .075 .739 .638 .856
Notes. R2=.607 (Cox & Snell), .818 (Nagelkerke). Model χ² (8) = 42.0, p < .001. *p <.05.
**p <.01
ASSIGNMENT 7
In this assignment, the dependent or grouping variable is workcon (working conditions).
We want to know if the following variables help to predict whether a person comes from a good or unpleasant working environment or not:
profdev : professional development available conflict : level of conflict between employees and bosses regulat : imposition of rules jobvar : jon swapping team: team spirit standrds : work performance standards
Conduct both the discriminant and binary logistic regression, report the results, and compare the difference in the accuracy of the model in predicting one’s working environment.
The data file is named working_environment.sav.
The assignment is adapted from Burns and Burns (2008).