Quantitative Methods Topic 9 Bivariate Relationships

Quantitative Methods

Topic 9

Bivariate Relationships

2

Outline

Crosstab (Exploring the relationship between two categorical variable).

Correlation (Exploring the relationship between two continuous variables, typically)

3

Data file

YR12SURV2.SAV YR12SURVEYCODING2.DOC (Questionnaire)Holland2fory12data.doc

4

RANKMS and RANKINV RANKMS is a constructed variable ranking informants

on the amount of time spent on Maths and Science A high rank (e.g. 3) means the informant spent a lot of time on

Maths and Science A low rank (e.g. 1) means the informant spent very little or no

time on Maths and Science. RANKINV is a similar rank type variable focused on

investigative interests: e.g. informant interested in laboratory work. A high rank (e.g. 4) means the informant was very interested A low rank (.e.g. 1) means the informant was least interested.

These are ordinal variables.

5

Relationships between two categorical variables Example of research questions

Is there a relationship between gender and student investigative interest?

Are males more likely to be interested in investigative activities than females?

Is the proportion of males in each of the investigative level the same as the proportion of females?

6

Variables

To answer the research question in the example above we will have to do crosstabulations between two variables

Gender RANKINV

7

Hypothesis of independence

There is no association between the two variables gender and RANKINV

There is no difference in the proportion of females and males in each of the categories (levels) of investigative interest

8

How to do crosstabulations in SPSS From the DATA menu select ANALYSE then

Descriptive Statistics then Crosstabs Move GENDER into the Column(s) window

and RANKINV into the Row(s) window Open the Statistics window and tick Chi-

square Continue to close Open the Cells window, under Counts tick

observed, under Percentages, tick Column, then click Continue

Click OK in the Crosstabs window to run

9

Screen

10

Output

There are two tables in the output that are important for us:

Table 1: Crosstab Table 2: Chi-square

11

Table 1: Crosstab of RankInv by Gender

rankinv * gender Crosstabulation

113 61 17430.7% 19.6% 25.6%

119 95 21432.3% 30.5% 31.5%

82 76 15822.3% 24.4% 23.3%

54 79 13314.7% 25.4% 19.6%

368 311 679100.0% 100.0% 100.0%

Count% within genderCount% within genderCount% within genderCount% within genderCount% within gender

1

2

3

4

rankinv

Total

1 2gender

Total

12

Interpreting Association in the Table

We can compare the column percentages along the rows and calculate the percentage point difference to see (in this case) whether females differ from males at each ‘level’ of interest

In the rankinvestgtv by Gender crosstabulation, for example, 30.7% of females were in category 1 (Very low investigative interest) compared with 19.6% of males, giving a percentage point difference of 11.1.

Similarly, there is a difference of 10.7 percentage points in the number of males having very high investigative interest compared with females

13

Table 2: Chi-square Statistics generated by Crosstabs

Chi-Square Tests

18.504a 3 .00018.642 3 .000

17.828 1 .000

.b

679

Pearson Chi-SquareLikelihood RatioLinear-by-LinearAssociationMcNemar TestN of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

0 cells (.0%) have expected count less than 5. The minimumexpected count is 60.92.

a.

Computed only for a PxP table, where P must be greater than 1.b.

Pearson Chi-square

value, degree of freedom

and significant

level

This helps us to check if expected

counts less than 5

14

Tests of Statistical Significance for Tables

Chi-square used to test the null hypothesis that there is no discrepancy between the observed and expected frequencies or there is no association between row and column variables

Chi-square based statistics can be used independently of level of measurement.

If chi-square is significant (say Asymp. Sig. <0.05) then we reject the null hypothesis and conclude that the data show some association compared with a (hypothetical) table in which the observed frequencies were determined solely by the separate distributions of the two crosstabulated variables (the ‘marginal distributions’)

If chi-square is not significant (say Asymp. Sig. >0.05) then we accept the null hypothesis and conclude that the data show no association compared with a (hypothetical) table in which the observed frequencies were determined solely by the separate distributions of the two crosstabulated variables (the ‘marginal distributions’)

15

Assumptions

Random samples Independent observations: the different

groups are independent of each other The lowest expected frequency in any cell

should be 5 or more

16

Chi-square Statistics-limitations

Chi-Square measures are sensitive to sample size

17

Interpretation of output from chi-square The note under the table shows that you have not violated

one of the assumptions of chi-square concerning ‘minimum expected cell frequency’

Pearson chi-square value: at 18.5 for 3 degrees of freedom Chi-square is highly

significant probability of this level of association occurring by

chance is less than 0.001. Degree of freedom=(r-1)(c-1) where r and c are number

of categories in each of the two variables. Conclusion: males are more likely than females to be

interested in investigative activities.

18

Class activity 1: Produce a similar table using GENDER by RANKMS

19

Summary of analyses of association

RANKINV and RANKMS are 4 and 3 categories (respectively) ordinal variables constructed, respectively, from the total score on Investigative interests and the proportion of curriculum time spent in Maths/Science

Gender heads the columns, interest and curriculum participation in Maths/Science form the rows (thus, by convention, gender is the explanatory or

independent variable, interest or curriculum participation the response or dependent variables)

20

Correlations

Strengths of relationships between two variables.

21

Correlation Examples of Research questions

Is there a relationship between student achievement in mathematics and English language?

Is there a relationship between parents’ incomes and children VCE results ?

Is there a correlation between SES and achievement ?

How strong are these relationships?

22

Assumptions (1) Scores are obtained using a random sample from

populations Independence of observations The distribution of the variable(s) involved is

normal Homoscedasticity: the variance of the dependent

variable is the same for values of X (residual variance, or conditional variance)

Linearity: The relationship between the two variables should be linear.

Related pairs: both pieces of information must be from the same subjects

23

Data Set

Vietnam Data Set vnsample2.sav

24

Scatter plot

25

Producing a Scatterplot GRAPHS-SCATTER-SIMPLE-DEFINE Select MEASUREMENT score (pmes500)

to make this the Y variable Select NUMBER score (pnum500) to make

this the X variable. Click OK The scatterplot should appear in the OUTPUT

window.

26

Scatter plot

27

Interpretation of Scatter plot Step 1: Checking for outliers Step 2: INSPECTING THE DISTRIBUTION OF DATA

POINT: Are the data points scattered all over. Are the data points neatly arranged in a narrow cigar

shape Could we draw a straight line through the main cluster of

points or would a curved line better represents the points Is the shape of the cluster even from one end to other. (if it

starts off narrow and then gets wider, The data may violate the assumption of homoscedasticity: at different value of X, variability of Y is different)

Step 3: Determining the direction of the relationship between the two variables: positive or negative correlations

28

When values on two variables tend to go in the same direction, we call this a direct relationship.

The correlation between children’s ages and heights is a direct relationship.

That is, older children tend to be taller than younger children.

This is a direct relationship because children with higher ages tend to have higher heights.

Direct Relationship

29

When values on two variables tend to go in opposite directions, we call this an inverse relationship.

The correlation between students’ number of absences and level of achievement is an inverse relationship.

That is, students who are absent more often tend to have lower achievement.

This is an inverse relationship because children with higher numbers of absences tend to have lower achievement scores.

Inverse Relationship

30

Scatter plot

31

How to run correlation

Highlight ANALYSE, CORRELATE, BIVARIATE Copy THE TWO VARIABLES INTO

VARIABLES box Check that PEARSON box (two continuous

variables- see the notes for other variable types) and the 2 tail box

Click OK

32

OUTPUT AND INTERPRETATIONCorrelations

1 .821**.000

733 733.821** 1

.000

733 733

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)

N

pnum500 NUMBER 500SCORE IN PUPILMATHEMATICS

pmes500 MEASUREMENT 500SCORE IN PUPIL MATH.

pnum500 NUMBER

500 SCOREIN PUPIL

MATHEMATICS

pmes500 MEASUREME

NT 500SCORE IN

PUPIL MATH.

Correlation is significant at the 0.01 level (2-tailed).**.

Step 1: Checking information about sample size

Step 2: Determining the directions and strengths of the relationships

Step 3:Calculating the coefficient of determination (r2)

Step 4: Assessing the significance

This is correlation

coefficient (r)

This is the p value

Number of cases

33

Correlation Coefficient The relationship between two variables may be

expressed with a number between -.1.00 and 1.00. This number is called a correlation coefficient.

The closer the correlation coefficient is to 0.00, the lower the relationship between the two variables. The closer the coefficient is to 1.00

or -1.00 the higher the relationship. According to Cohen (1988)

R=.10 to .29 or R=-.10 to -.29: Small R=.30 to 0.49 or R=-.30 to -0.49 Medium R=.50 to 1.00 or R=-.50 to -1 Large

34

Some caveats about correlation and scatter plots - 1 Make a scatter plot of Measurement score

against Number score again. This time, double click on the plot to get

into Chart Editor. Change both X and Y axis scales to have

a minimum of 200 and a maximum of 750. Does the strength of the relationship look

weaker in this graph as compared to the one where the min is 0 and max is 1000?

35

Some caveats about correlation and scatter plots - 2 Be aware that judging the strength of

relationship based on visual perception of scatter plot could be flawed, as the scale of the plots can make a difference.

36

Some caveats about correlation and scatter plots - 3 Create a new variable pmes10 using

Transform compute new variablesuch that pmes10=pmes500/100 +5

That is, we have transformed the measurement score to have a mean of 10 and a standard deviation of 1.

Compute the correlation between pmes10 and pnum500.

How does this correlation compare with the correlation between pmes500 and pnum500?

37

Some caveats about correlation and scatter plots - 4 Compute the correlation between pmes500 and

pnum500, but only for scores between 300 and 600.

You can do this by selecting a sample Data Select cases If condition is satisfied:

pnum500 > 300 and pnum500<600 and pmes500>300 and pmes500<600

How does the correlation compare with that from the full sample?

38

Some caveats about correlation and scatter plots - 5 This link shows TIMSS and PISA 2003

maths country mean scores for 22 countries. TIMSSandPISA 2003.doc

Plot a scatter graph between TIMSS and PISA scores

Compute the correlation Repeat without Tunisia and Indonesia.

39

Coefficient of determination is the squared correlation coefficient, and represents the proportion of common

variation in the two variables

40

The proportion of variance explained is equal to the square of the correlation coefficient

If the correlation between alternate forms of a standardized test is 0.80, then (0.80)2 or 0.64 of the variance in scores on one form of the test is explained or associated with variance of scores on the other form

That is, 64% of the variance one sees in scores on one form is associated with the variance of scores on the other form. Consequently, only 36% (100% – 64%) of the variance of scores on one form is unassociated with variance of scores on the other form.

Proportion of variance explained

41

Presenting the results for Correlation Purpose of the test Variables involved r values, number of cases, p value R-square Interpretation

42

Class activity 1

What are the correlations between three dimensions of mathematics: number, measurement and space?

Is it true that students who perform well in number strand also doing well in measurement and space strands?

Datafile: VNsample2.sav Variables: student scores in measurement, number

and space

43

Regression line

44

Producing a Scatterplot with regression line GRAPHS-SCATTER-SIMPLE-DEFINE Select MEASUREMENT 500 SCORE IN PUPIL

MATH to make this the Y variable Select NUMBER 500 SCORE IN PUPIL MATH

and to make this the X variable. Click OK The scatterplot should appear in the OUTPUT window. Double click anywhere in the scatter plot to open SPSS

CHART EDITOR, Click on ELEMENTS< FIT LINE AT TOTAL, make sure that LINEAR is selected.

45

Regression line

46

Regression line

y = 0.7444x + 124.76R2 = 0.5654

0100200300400500600700800900

1000

0 200 400 600 800 1000

Series1

Linear (Series1)

47

Linear Regression Example of research questions

To what extent can student achievement in Vietnamese language predict student achievement in mathematics?

How well family income (or wealth) can predict student performance?

How well university entrance scores can predict student success in University?

48

Regression equation

y= a+bx+e

y and x are the dependent and independent variables respectively

a is the intercept (the point at which the line cuts the vertical axis.

b is the slope of the line or the regression coefficient. e is error term.

49

How to RUN REGRESSION

Highlight ANALYSE, REGRESSION, LINEAR Copy THE CONTINOUS DEPENDENT VARIABLE INTO

DEPENDENT box Copy THE INDEPENDENT VARIABLE INTO

INDEPENDENT box For METHOD, make sure that ENTER is selected. Click STATISTICS and tick on ESTIMATES, MODEL FIT

AND DESCRIPTIVES, then CONTINUE Click on OPTIONS, then INCLUDE CONSTANT IN

EQUATION, EXCLUDE CASES PAIRWISE, then CONTINUE

Click OK

50

An example for regression

Research question: To what extent that student scores in reading can predict student scores in Mathematics?

Datafile: VNsample2.sav Variables: Reading 500 scores,

Mathematics 500 scores achievement

51

OUTPUT AND INTERPRETATIONDescriptive Statistics

494.8215 104.11670 733

493.1033 103.07274 733

prd500 PUPILREADING 500 SCOR.[MEAN=500/SD=100]pma500 PUPILMATHEMATICS 500SCORE

Mean Std. Deviation N

Correlations

1.000 .752

.752 1.000

. .000

.000 .

733 733

733 733

prd500 PUPILREADING 500 SCOR.[MEAN=500/SD=100]pma500 PUPILMATHEMATICS 500SCOREprd500 PUPILREADING 500 SCOR.[MEAN=500/SD=100]pma500 PUPILMATHEMATICS 500SCOREprd500 PUPILREADING 500 SCOR.[MEAN=500/SD=100]pma500 PUPILMATHEMATICS 500SCORE

Pearson Correlation

Sig. (1-tailed)

N

prd500 PUPIL

READING 500SCOR.

[MEAN=500/SD=100]

pma500 PUPIL

MATHEMATICS 500 SCORE

Step 1: Checking Descriptive

52

OUTPUT AND INTERPRETATION (2)

Model Summary

.752a .565 .565 67.99595Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), prd500 PUPIL READING 500SCOR.[MEAN=500/SD=100]

a.

Step 2: Evaluating the model

53

Output and interpretation

Coefficientsa

124.761 12.205 10.222 .000

.744 .024 .752 30.839 .000

(Constant)prd500 PUPILREADING 500 SCOR.[MEAN=500/SD=100]

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: pma500 PUPIL MATHEMATICS 500 SCOREa.

Step 3: Evaluating the effect of the independent variable

The t-tests with significance levels for the

constant (a) and the regression

co-efficient (b)

54

Presenting the results for regression Purpose of the test Variables involved Number of cases Intercept Un-standardised b and standardised (beta)

coefficients, SE, p value R-square Interpretation of the relationship

55

Class activity 2

To what extent school resources can predict for student achievement in maths and Vietnamese language

Datafile: VNsample2.sav Variables: school resources index, math

achievement

Quantitative Methods Topic 9 Bivariate Relationships

Documents

Transcript of Quantitative Methods Topic 9 Bivariate Relationships