Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša)...

30
Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković [email protected] TA: Wang Yu [email protected]

Transcript of Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša)...

Page 1: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Intermediate Applied Statistics STAT 460

Lecture 20, 11/19/2004

Instructor:Aleksandra (Seša) Slavković[email protected]

TA:Wang [email protected]

Page 2: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Revised schedule

Nov 8 lab on 2-way ANOVA Nov 10 lecture on two-way ANOVA and blocking

Post HW9

Nov 12 lecture repeated measure and review

Nov 15 lab on repeated measures Nov 17 lecture on categorical data/logistic regression

HW9 due

Post HW10

Nov 19 lecture on categorical data/logistic regression

Nov 22 lab on logistic regression & project II introduction

No class

Thanksgiving

No class

Thanksgiving

Nov 29 lab Dec 1 lecture

HW10 due

Post HW11

Dec 3 lecture and Quiz

Dec 6 lab Dec 8 lecture

HW 11 due

Dec 10 lecture & project II due

Dec 13 Project II due

Page 3: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Last lecture

Categorical Data

Page 4: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

This lecture

Categorical Data/Response (ch. 18,19,20) Odds

Page 5: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Review: Categorical Variable

Notation: Population proportion = = sometimes we use p Population size = N Sample proportion = = X/n = # with trait / total # Sample size = n

The Rule for Sample Proportions If numerous samples of size n are taken, the frequency curve of the

sample proportions ( ‘s) from the various samples will be approximately normal with the mean and standard deviation

~ N( , (1- )/n )

(1− π )

n

ˆ π

ˆ π

ˆ π

Page 6: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

One-sample approximate z test and z-interval for π.

Page 7: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

These tests can be extended to test the difference in parameters π between two groups.

Page 8: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Difference between proportions

These tests can be extended to test the difference in parameters π between two groups.

Page 9: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Warning:

z-tests for proportions are based on an approximation. They don’t work for small samples. It is often said that n is large enough if

Because of improved computing power, an exact test based on the binomial distribution rather than the normal is now available in most software.

5an greater thboth are )ˆ-n(1 and ˆn

Page 10: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Analysis Grid (ref. Handout)

Quantitative Explanatory

Discrete Explanatory

Both

Quantitative Outcome

Regression ANOVA Regression (ANCOVA)

Discrete Outcome

Logistic Regression

Chi-Square Test of Independence

Logistic Regression

Page 11: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Contingency Table

A statistical tool for summarizing and displaying results for categorical variables

A two-way table if for two categorical variables

2x2 Table, for two categorical variables, each with two categories

Place the counts of each combination of the two variables in the appropriate cells of the table.

Exploratory variable as labels for the rows, response variable as labels for the columns.

Page 12: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Example A university offers only two degree programs: English and

Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a two-way table of all applicants by sex and admission status:

These data show an association between the sex of the applicants and their success in obtaining admission.

Male Female Total

Admit 35 20 55

Deny 45 40 85

Total 80 60 140

Page 13: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Marginal & Conditional Distributions Marginal Distributions:

Exploratory Variable: add up values for the rows; take away response variable

In our example distribution is: 55, 85, 140 Observed proportions:

‘admit’ = 55/140 = 0.39 ‘deny’ = 85/140 = 0.61 NOTE: they add up to 1

Response Variable: add up values for the columns; take away exploratory variable

In our example distribution is? Observed proportions are:

Do they add up to 1?

Page 14: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Marginal & Conditional Distributions

Conditional Distribution: Conditional percentages; what percent of a particular row or a

column a count in a cell is.

Conditional distribution of gender for those admitted: % of admitted who are male = 35/55 = 0.63 = 63% % of admitted who are female = ?

What is: % of male applicants admitted = ? % of female applicants admitted = ?

Page 15: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Statistical Significance

An observed relationship is statistically significant if the chances of observing the relationship in the sample when there is no actual relationship in the population are small (usually less than 5%)

In other words, a relationship is statistically significant if that relationship is stronger than 95% of the relationships we would expect to see just by chance.

If we say that there was no statistically significant relationship found, that does not mean that there is no relationship at all!

Warnings: If a sample size is small, strong relationships may not achieve

significance If a sample size is large, even minor relationships could achieve

significance but these might not then have practical importance

Page 16: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Chi-Squared Test (2 Test)

A Chi-Squared Test for independence The Chi-Squared Statistics (2 ) for contingency table.

Follows 2 distribution Skewed to the right Min = 0, Max = infinity

As the strength of observed relationship in the sample increase, the statistic increases.

It combines info about a strength of the relationship and the sample size into a one number

Can be calculated for any size contingency table For 2 x 2 table: if 2 > 3.84 then we have a statistically significant relationship

We either show (2 > 3.84) or fail to show significant relationship (if 2 < 3.8); we either reject (2 > 3.84 ) or fail to reject (2 < 3.84) the claim of independence between two variables that is our null hypothesis.

H0: variables are independent HA: variabls are NOT independent

Page 17: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

2

The chi-squared distribution with k-1 degrees of freedom acts as though it was the sum the squares of k-1 independent Normal(0,1) distributions. (Not that you need to know.)

See table on pages 1100-1101 in textbook.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

df=3

df=4

df=5

df=10

Page 18: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

You Must Know:

How to calculate 2 statistic Compute the expected numbers Compare the expected and observed numbers Compute the 2 statistic

How to compare it to 3.84 for 2x2 tables

How to make proper conclusion about statistical relationship and in general about the question of interest for any two-way and k-way tables.

Calculate χ obs2

Determine χ crit2 based on α and df

Decision if χ obs2 < χ crit

2 don't reject H0

Page 19: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

For our example:

Computing 2 statistic: Expected number = the number of counts (individuals) that we expect to fall in a particular

cell = (row total)(column total)/(table total) Expected number of admitted male students = (55 x 80)/140 = 31.42 Expected number of admitted female students = ?

Observed number = the number of counts in the cell Observed number of admitted male students = 35 Observed number of admitted female students = ?

Compare the observed and expected number :( observed – expected)2/(expected number)

For male students: (35 - 31.42)2/(31.42) = 0.41

For female students: = ?

Compute the statistic = Sum all the above calculated numbers for all the cells In our case 2 = 1.58

Compare it to 3.84

Is it statistically significant? Are admission decisions independent of the gender?

Page 20: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.
Page 21: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Relative Risk, Increased Risk, Odds Ratio

Quantifications of the chances of a particular outcome and how do these chances change

What are the chances that a randomly selected individual would fall into a particular category for a categorical variable.

There are two basic ways to express these chances: Proportions = expressing one category as a proportion

of the total Proportion of admitted students who are female =

20/55 = 0.36 Odds = comparing one category to another

Odds of being admitted = 55 to 85 = 55/85 to 1

Page 22: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Expressing Proportions & Odds

There are 4 equivalent ways to express proportions: Percent = Proportion = Probability = Risk

36% (percent) of all admitted students are females The proportion of females admitted is 0.36 The probability that a female would be admitted is 0.36 The risk for a female to be admitted is 0.36

Odds = expressed by reducing the numbers with and without a characteristic we are interested in to the smallest possible whole number: The odds of being admitted = 55 to 85 = 7 to 11 = 7/11 to 1

Going back and forth between proportions and odds: If the proportion has value p then the odds are: /(1- ) to 1 If the odds of having a characteristic are a to b, then the proportion with

the characteristic is a/(a+b)

Page 23: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Generalized forms for the expressions: Percentage with the characteristic = (number with the

characteristic/total) x 100%

Proportion with the characteristic = (number with the characteristic/total)

Probability of having the characteristics = (number with the characteristic/total)

Risk of having the characteristic = (number with the characteristic/total)

Odds of having the characteristic = (number with the characteristic/number without characteristics) to 1

= /(1- )

Page 24: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Types of Risk: Relative risk & Increased Risk

Relative risk = the ratio of the risks for each category of the exploratory variable Relative risk of being a female based on whether you are rejected or accepted:

Risk for being rejected if you are female = 40/85 = 0.47 Risk of being accepted if you are female = 20/55 = 0.36 Relative risk = 0.47/0.36 = 1.31 to 1

What does this mean?

What does a relative risk of 1 mean?

Increased Risk = usually, the percent increase in risk Increased risk = (change in risk/original risk) x 100%

Change in risk = 0.47 – 0.36 = 0.11 Original risk = Baseline risk = 0.36 Increased risk = 0.11/0.36x 100% = 0.31 = 31%

There is a 23% increase in the chances of females to be rejected

Increased risk = (relative risk – 1.0) x 100% Increased risk = (1.31 – 1.0) x 100% = 31%

Page 25: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Odds Ratio

First calculate the odds of having a characteristic versus not having it: Odds for female being admitted = 20/35 =0.571429 Odds for female being rejected = 40/45= 0.888889

Then take the ratio of these odds: Odds ratio = 0.888889/ 0.571429 = 1.5556 Not too close to 1.31, but sometimes it can be close to relative risk

Odds ratio = (upper left * lower right)/(upper right * lower left) Sometimes you need to reverse denominator and numerator so that

the ratio is greater than 1 (easier to interpret)

Page 26: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Misleading items about Risk/Odds

The baseline risk is missing

The time period of the risk is not identified

The reported risk is not necessarily your risk (relative risk vs. your risk)

Retrospective vs. Prospective study Prospective: take a random sample and record success and failure

in the future Retrospective: take a random sample and record success and failure

that happened in the past In retrospective study you can meaningfully interpret odds ratio, but

not individual odds

Page 27: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Simpson’s Paradox

Lurking variable = A variable that changes the nature of association even reverses direction of relationship between two other variables.

A nature of association changes due to a lurking variable

In our example we didn’t consider type of a program (major) as a variable. What happens if we do, and if construct two separate tables, one for each major?

Page 28: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Example of Simpson’s Paradox

Computer Science admits each 50% of males and females English takes ¼ of both males and females Now there doesn’t seem to be an association between sex and

admission decision in either program Hence, type of program was a lurking variable

Computer Science

Male Female

Admit 30 10

Deny 30 10

Total 60 20

English

Male Female

Admit 5 10

Deny 15 30

Total 20 40

Page 29: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Commands in SAS

To create contingency tables, calculate chi-square statistic, etc… Statistics/Table Analysis

To run the logistic regression Statistics/Regression/Logistic

Page 30: Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu.

Next

Lab Monday Categorical Data,

Logistic Regression -- we will work through the lab together and learn about logistic regression

Project II