Lecture 7 guidelines_and_assignment

Introduction to Applied Statistics and Applied Statistical Methods Practical guidelines

Prof. Dr. Chang Zhu page 1

Table of Contents

LECTURE 7 ..................................................................................................................................................... 2

CHI-SQUARE TEST (CROSS-TAB) .................................................................................................................... 2

SPSS OUTPUT ............................................................................................................................................ 3

REPORTING THE RESULT ........................................................................................................................... 3

DISCRIMINANT ANALYSIS .............................................................................................................................. 4

SPSS OUTPUT ............................................................................................................................................ 5

REPORTING THE RESULT ........................................................................................................................... 9

LOGISTIC REGRESSION .................................................................................................................................. 9

SPSS OUTPUT .......................................................................................................................................... 10

REPORTING THE RESULT ......................................................................................................................... 15

ASSIGNMENT 7 ............................................................................................................................................ 16



LECTURE 7

CHI-SQUARE TEST (CROSS-TAB)

A group of students were classified in terms of personality (introvert or extrovert) and in terms of colour preference (red, yellow, green or blue). Personality and colour preference are categorical variables. We want to find answer to this question:

Is there an association between personality and colour preference?

In SPSS, Analyze > Descriptive Statistics > Crosstab

Move the variable person to the Row(s) and colour to the Column (s) areas.

Click on the Statistics button and select the Chi-square option. Click Continue to proceed to the next step.

Click on the Cells button, under Counts: choose Observed and Expected. For Percentages: select Row, Colum, and Total.

After finish, click Continue to proceed and OK to run the analysis.



SPSS OUTPUT

The table personality type*favorite colour Crosstabulation shows us the percentages of each level of the two variables.

The Chi-square Tests table gives us the significance of the test.

Chi-Square Tests

Value df Asymp. Sig. (2-

sided)

Pearson Chi-Square 71.200a 3 .000

Likelihood Ratio 70.066 3 .000

Linear-by-Linear Association 69.124 1 .000

N of Valid Cases 400

REPORTING THE RESULT

We can write a conclusion like this:

There is a relationship between students’ personality and preferences for colours: χ² (3, N = 400) =

71.20, p < .0001.

personality type * favourite colour Crosstabulation

favourite colour

Total red yellow green blue

personality type introvert Count 20 6 30 44 100

Expected Count 50.0 10.0 20.0 20.0 100.0

% within personality type 20.0% 6.0% 30.0% 44.0% 100.0%

% within favourite colour 10.0% 15.0% 37.5% 55.0% 25.0%

% of Total 5.0% 1.5% 7.5% 11.0% 25.0%

extrovert Count 180 34 50 36 300

Expected Count 150.0 30.0 60.0 60.0 300.0



% of Total 45.0% 8.5% 12.5% 9.0% 75.0%

Total Count 200 40 80 80 400

Expected Count 200.0 40.0 80.0 80.0 400.0



% of Total 50.0% 10.0% 20.0% 20.0% 100.0%



DISCRIMINANT ANALYSIS

A study is set up to determine if the following variables help to discriminate between those who smoke and those whose don’t:

age absence (days of absence last year) selfcon (self-concept score) anxiety (anxiety score) anti_smoking (attitude towards anti-smoking policies)

In SPSS, Analyze > Classify > Discriminant

Move the categorical variable smoke into the Grouping Variable, and age, selfcon, anxiety, absence, and anti_smoking into the Independent area.

Click on Define Range to indicate the values that we have assigned for each group. In our case, 1 is for non-smokers and 2 is for smokers.

Click Continue to proceed to the next step.

Click on Statistics to access the Statistics dialog box.

In the Descriptive statistics, select Means, Univariate ANOVAs (which tests whether the means are different between groups), and Box’s M (which tests whether the homogeneity of variances between groups, in this aspect, it’s very similar to Levene’s test).

Under the Function Coefficients, choose Unstandardized, then click Continue to proceed.



Click on the Classify to access the Classification dialog box. For unequal group size (the numbers between smokers and non-smokers are not equal), we will select Compute from group sizes for the Prior Probabilities.

Under the Use Covariance Matrix, select Within-groups. The important options we should select are Summary Tables (which shows us how many percentages of cases are correctly classified based on the model) and Leave-one-out classification (cross-validation of the model).

It’s useful to select Separate-groups under the Plots as this will plot the variate score for each participant grouped according to whether they are classified as smokers or non-smokers. For more than 2 groups, the preferred option is Combined-groups.

Click Continue to proceed and select the Save option. We will choose Predicted group membership and Discriminant scores. This process will create 2 new variables in the data set.

Click Continue to proceed and OK to run the analysis.

SPSS OUTPUT

The first important table is obtained from the option Univariate ANOVAs, indicating that the mean scores across the 5 predictors are significantly different between smokers and non-smokers. This is positive, i.e. we can use them as discriminants.

Tests of Equality of Group Means

Wilks' Lambda F df1 df2 Sig.

self concept score .526 392.672 1 436 .000

anxiety score .666 218.439 1 436 .000

days absent last year .931 32.109 1 436 .000

total anti-smoking test score .887 55.295 1 436 .000

age .980 8.781 1 436 .003

The Box’s M tests the null hypothesis that the covariance matrices do not differ between groups and we want this test to be non-significant. The output shows that p <.001, which is highly significant. However, as our sample size is big (n = 438), so this is not a big problem.



Test Results

Box's M 176.474

F Approx. 11.615

df1 15

df2 600825.345

Sig. .000

Tests null hypothesis of equal population covariance matrices.

Next, the important value we should look at in the Eigenvalues table is the canonical correlation, which is the multiple correlation between the predictors and the discriminant function (Burns & Burns, 2008). Therefore, it’s interpreted as the R

2 in multiple regression. With R = .802, it can be concluded that the

model explains 64.32% of the variation in the grouping variable (smokers or non-smokers).

Eigenvalues

Function Eigenvalue % of Variance Cumulative % Canonical Correlation

1 1.806a 100.0 100.0 .802

a. First 1 canonical discriminant functions were used in the analysis.

The Wilk’s Lamda is a statistic that tests the significance of the discriminant function, which we will come across later. The value of Wilks’ Lamda is simply the unexplained variance. We can consider it as a goodness of fit statistic, and this test should be significant to confirm the model.

Wilks' Lambda

Test of Function(s) Wilks' Lambda Chi-square df Sig.

1 .356 447.227 5 .000

The standardized canonical discriminant function coefficients table gives us a reference with regard to the importance of each predictor. According to the result, self-concept and anxiety scores are the strongest predictors.



Standardized Canonical Discriminant Function Coefficients

Function

1

self concept score .763

anxiety score -.614

days absent last year -.073

total anti-smoking test score .378

age .212

The coefficients in the Canonical Discriminant Function Coefficients table are used to create the discriminant function to predict group membership.

Canonical Discriminant Function Coefficients

Function

1

age .024

self concept score .080

anxiety score -.100

days absent last year -.012

total anti-smoking test score .134

(Constant) -4.543

Unstandardized coefficients

We can write as follows:

D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent) + (.134 anti-smoking score) - 4.543

The Functions at group Centroids gives us the group means calculated by using the discriminant function, called centroids.

Functions at Group Centroids

smoke or not

Function

1

non-smoker 1.125

smoker -1.598



If we look at the two bar charts created based on discriminant scores calculated for each participant, we can see that the means are simply the centroids.

Finally, we will find how many percentage of the cases are correctly classified based on the discriminant function. There are 238 non-smokers out of 257 and 164 smokers out of 181 are correctly classified (91.8%).

Classification Resultsa,c

smoke or not

Predicted Group Membership

Total non-smoker smoker

Original Count non-smoker 238 19 257

smoker 17 164 181

% non-smoker 92.6 7.4 100.0

smoker 9.4 90.6 100.0

Cross-validatedb Count non-smoker 238 19 257

smoker 17 164 181

% non-smoker 92.6 7.4 100.0

smoker 9.4 90.6 100.0

a. 91.8% of original grouped cases correctly classified.

b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.

c. 91.8% of cross-validated grouped cases correctly classified.




When reporting the result, we should include the following:

Name of the predictors and sample size Results of the Univariate ANOVAs and the Box’s M test The significance of the discriminant function The variance explained (Canonical correlation coefficient) Significant predictors and their contribution to the model (discriminant function) Result from the cross-validation process

You can write:

A discriminant analysis was conducted to predict whether an employee was a smoker or not. Predictor variables were age, number of days from work in previous year, self-concept score, anxiety score, and attitude to anti-smoking workplace policy. A total of 438 cases were analyzed. Univariate ANOVAs revealed that the smokers and non-smokers differed significantly on each of the five predictor variables. Box’s M indicated that the assumption of equality of covariance matrices was violated. However, given the large sample, this problem is not regarded as serious. The discriminant function revealed a significant association between groups and all predictors (Λ = .356, χ² = 447.23, df = 5, p < .001), accounting for 64.32% of between group variability, although closer analysis of the structure matrix revealed only two significant predictors, namely self-concept score (.706) and anxiety score (–.527) with age and absence being poor predictors. The cross validated classification showed that overall 91.8% were correctly classified.

LOGISTIC REGRESSION

Logistic regression can also be used to predict group membership. The differences are:

Logistic regression does not require that the predictor variables should be normally distributed. Logistic regression can handle categorical predictors whereas discriminant analysis can only

work with scale variables. Logistic regression gives us the odds ratio to explain how likely an event occurs as an increase in

the predictor’s unit of measure. In this respect, logistics regression helps to explain the mechanism of the change in membership, rather than providing a cut-off value for discriminating between the two groups.

When we have large sample size and equal groups, as well as normally distributed data points, discriminant analysis is recommended as it is more powerful than logistic regression.

We will try to answer the same research question: “Which predictor variables help discriminate between groups (smokers and non-smokers)?”

But this time we will also add gender as a predictor and use Stepwise (Forward: Likelihood Ratio) as the regression method.

In SPSS, Analyze > Regression > Binary Logistic

Move the variable smoke into the Dependent area and age, gender, selfcon, anxiety, absence, and anti_smoking into the Covariates area.



Click on the Categorical option to define the categorical variable in the analysis. In the Reference Category, we can choose either Last or First depending on how we code the variable.

Click Continue to proceed to the next step, then click Save to access the Save dialog box. Then select Probabilities, Group membership and Standardized in the Predicted Values and the Residuals respectively. After finish, click Continue.

Click Options to access the next dialog box. The default options are suggested, but probably we may want to know if there are any outliers greater than 2 standard deviations and how may iterations it takes for SPSS to come up with the solution.

Click Continue to proceed and OK to run the analysis.

SPSS OUTPUT

The Classification Table is the percentage of accurate classification using the baseline model, which predicts group membership by assigning all participants to one category with a larger number.



Block 0. When no predictors are entered

Classification Tablea,b

Observed Predicted

smoke or not

Percentage Correct non-smoker smoker

Step 0 smoke or not non-smoker 257 0 100.0

smoker 181 0 .0

Overall Percentage 58.7

a. Constant is included in the model.

b. The cut value is .500

The table Variables in the Equation shows that without any predictor, using the constant only, the constant b0 is -.351 and interestingly this model is significant. However as the percentage of accurate classification is not very high, we will consider adding more predictor into the prediction model.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 0 Constant -.351 .097 13.053 1 .000 .704

The Variables not in the Equation shows that all the variables (except gender) are likely to contribute to the model, except gender with a p > .05 (non-significant).

Variables not in the Equation

Score df Sig.

Step 0 Variables selfcon 207.549 1 .000

anxiety 146.196 1 .000

absence 30.043 1 .000

anti_smoking 49.297 1 .000

gender(1) .001 1 .981

Overall Statistics 277.460 5 .000



Block 1. Enter predictors into the model step by step, with the most important one being selected first

We have 5 variables, and if we look at the table Omnibus Tests of Model Coefficients (which compare how significantly better our new models is compared with the baseline model), we know that only 4 variables are selected because only these can significantly improve the predictive power.

Omnibus Tests of Model Coefficients

Chi-square df Sig.

Step 1 Step 298.660 1 .000

Block 298.660 1 .000

Model 298.660 1 .000

Step 2 Step 88.665 1 .000

Block 387.326 2 .000

Model 387.326 2 .000

Step 3 Step 15.120 1 .000

Block 402.446 3 .000

Model 402.446 3 .000

Step 4 Step 6.643 1 .010

Block 409.089 4 .000

Model 409.089 4 .000

Gender does not significantly contribute to the model, hence is removed, which can be identified in the table labeled as Variables not in the Equation.



Variables not in the Equation

Score df Sig.

Step 1 Variables anxiety 72.171 1 .000

absence 4.449 1 .035


gender(1) .000 1 .997


Step 2 Variables absence 1.938 1 .164


gender(1) .007 1 .935


Step 3 Variables absence 6.460 1 .011

gender(1) .005 1 .945


Step 4 Variables gender(1) .858 1 .354

Overall Statistics .858 1 .354

The Model Summary table tells us 2 things:

One is the -2 Log likelihood (-2LL) after each step. We expect there is a decrease in the -2LL as this will show that the model’s prediction power is improved. This should be accompanied by a significant Chi-square test which can be found in the Hosmer and Lemeshow Test table.

Second, the R square values calculated by 2 different approaches and hence can be different.

Model Summary

Step -2 Log likelihood Cox & Snell R

Square Nagelkerke R

Square

1 295.282a .494 .666

2 206.617b .587 .791

3 191.497b .601 .810

4 184.854c .607 .818



Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 26.262 8 .001

2 54.184 8 .000

3 27.230 8 .001

4 42.041 8 .000

The table Variables in the Equation shows the coeffcients (B) which can be used in the model to predict group membership and the odds ratio (Exp (B)).

Interpreting the odds ratio:

If the odds ratio > 1: when the predictor increases, the odds of the event occurs increase. In our case, it can be interpreted like this: the odds of someone falls into the smoker group is 1.26 times higher for those with one score higher in anxiety.

If the odds ratio < 1: when the predictor increases, the odds of the event occurs decreases. So, for every extra score in the anti-smoking test, the odds of someone falls into the smoker group is reduced by a factor of .739

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a selfcon -.240 .023 110.915 1 .000 .787 .752 .823

Constant 8.221 .789 108.483 1 .000 3717.359

Step 2b selfcon -.253 .029 77.035 1 .000 .777 .734 .822

anxiety .253 .036 49.929 1 .000 1.288 1.201 1.382

Constant 2.749 1.024 7.214 1 .007 15.629

Step 3c selfcon -.249 .030 67.724 1 .000 .780 .735 .827

anxiety .231 .035 44.068 1 .000 1.260 1.177 1.349

anti_smoking -.244 .067 13.247 1 .000 .783 .687 .893

Constant 8.338 1.946 18.367 1 .000 4179.821

Step 4d selfcon -.260 .033 63.281 1 .000 .771 .724 .822

anxiety .236 .036 44.213 1 .000 1.266 1.181 1.357

absence .075 .030 6.214 1 .013 1.078 1.016 1.144

anti_smoking -.303 .075 16.286 1 .000 .739 .638 .856

Constant 9.257 2.050 20.398 1 .000 10480.856

a. Variable(s) entered on step 1: selfcon.

b. Variable(s) entered on step 2: anxiety.

c. Variable(s) entered on step 3: anti_smoking.

d. Variable(s) entered on step 4: absence.



Finally, the Classification Table tells us the percentage of accurate prediction of group membership. We find that 238 non-smokers and 164 smokers are correctly classified, accounting for 91.8%.

Classification Tablea

Observed

Predicted

smoke or not

Percentage Correct non-smoker smoker


smoker 26 155 85.6



smoker 26 155 85.6



smoker 21 160 88.4



smoker 17 164 90.6


a. The cut value is .500

As SPSS code non-smokers as 0 and smokers as 1, the cut value is .500, meaning that using the coefficients to calculate the probability of someone belonging to one group, a value less than .500 will suggest that this person belongs to the non-smoker group.


We can write a report like this:

A logistic analysis was conducted age, gender, number of days from work in previous year, self-concept score, anxiety score, and attitude to anti-smoking workplace policy as predictors. A total of 438 cases were analyzed. The full model significantly predicted whether an employee is a smoker or non-smoker (χ² = 42.04, df = 8, p < .001), accounting for between 60.7% and 81.8% on the variance in the group membership with 92.6% non-smokers and 90.6% smokers successfully predicted. Table 1 presents the beta values, their standard errors and significance value, the odds ratios and its confidence interval.



Table 1

Results Of The Logistic Regression

B S.E. Odds Ratio

95% C.I. for Odds Ratio

Lower Upper

constant 9.257**

2.050 10480.856

self-concept

-.260** .033 .771 .724 .822

anxiety

.236** .036 1.266 1.181 1.357

absence

.075

* .030 1.078 1.016 1.144

anti-smoking test score

-.303** .075 .739 .638 .856

Notes. R2=.607 (Cox & Snell), .818 (Nagelkerke). Model χ² (8) = 42.0, p < .001. *p <.05.

**p <.01

ASSIGNMENT 7

In this assignment, the dependent or grouping variable is workcon (working conditions).

We want to know if the following variables help to predict whether a person comes from a good or unpleasant working environment or not:

profdev : professional development available conflict : level of conflict between employees and bosses regulat : imposition of rules jobvar : jon swapping team: team spirit standrds : work performance standards

Conduct both the discriminant and binary logistic regression, report the results, and compare the difference in the accuracy of the model in predicting one’s working environment.

The data file is named working_environment.sav.

The assignment is adapted from Burns and Burns (2008).

Lecture 7 guidelines_and_assignment

Documents

Transcript of Lecture 7 guidelines_and_assignment