EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output...

38
EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013

Transcript of EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output...

Page 1: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

EIPB 698E Lecture 10

Raul Cruz-CanoFall 2013

Page 2: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Comments for future evaluations

• Include only output used for conclusions• Mention p-values explicitly (also Equal

Variance p-value is necessary) and state conclusions clearly

• Everybody can attend next week’s review

Page 3: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc Reg• The REG procedure is one of many regression procedures in

the SAS System.

PROC REG < options > ; MODEL dependents=<regressors> < / options > ; BY variables ; OUTPUT < OUT=SAS-data-set > keyword=names ;

Page 4: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

data blood;INFILE ‘F:\blood.txt';INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;run;

data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run;

proc reg data =blood1; model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ;run;

Page 5: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 7 41237 5891.02895 2.54 0.0140Error 655 1521839 2323.41811Corrected Total 662 1563076

DF - These are the degrees of freedom associated with the sources of variance. (1) The total variance has N-1 degrees of freedom (663-1=662). (2) The model degrees of freedom corresponds to the number of predictors minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has 8-1=7 degrees of freedom. (3) The Residual degrees of freedom is the DF total minus the DF model, 662-7 is 655.

Page 6: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 7 41237 5891.02895 2.54 0.0140Error 655 1521839 2323.41811Corrected Total 662 1563076

Sum of Squares - associated with the three sources of variance, total, model and residual.

SSTotal The total variability around the mean. Sum(Y - Ybar)2. SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted)2.SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar)2.

Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of R-Square, the proportion of the variance explained by the independent variables

Page 7: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 7 41237 5891.02895 2.54 0.0140Error 655 1521839 2323.41811Corrected Total 662 1563076

Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model

Page 8: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 7 41237 5891.02895 2.54 0.0140Error 655 1521839 2323.41811Corrected Total 662 1563076

F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean Square Residual. F-value and P value are used to answer the question "Do the independent variables predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable.

Page 9: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output

Root MSE 48.20185 R-Square 0.0264Dependent Mean 201.69683 Adj R-Sq 0.0160Coeff Var 23.89817

Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).

Page 10: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output

Root MSE 48.20185 R-Square 0.0264Dependent Mean 201.69683 Adj R-Sq 0.0160Coeff Var 23.89817

Dependent Mean - This is the mean of the dependent variable.

Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(48.2/201.69) =23.90).

How much variability is explained by the model

Page 11: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc reg output

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t|

Intercept 1 187.91927 17.45409 10.77 <.0001 sex 1 1.48640 3.79640 0.39 0.6955 typeA 1 0.74839 4.01841 0.19 0.8523 typeB 1 10.14482 6.97339 1.45 0.1462 typeAB 1 -19.90314 10.45833 -1.90 0.0575 Age_old 1 -11.61798 3.85823 -3.01 0.0027 RBC 1 0.00264 0.00191 1.38 0.1676 WBC 1 0.20512 1.88816 0.11 0.9135

t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used in testing

the null hypothesis that the

coefficient/parameter is 0.

Page 12: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Logistic regression

• For binary response models, the response, Y, of an individual or an experimental unit can take on one of two possible values, denoted for convenience by 1 and 0 (for example, Y=1 if a disease is present, otherwise Y=0). Suppose x is a vector of explanatory variables and is the response probability to be modeled. The logistic regression model has the form

Logit (P(Y=1)) =log (P(Y=1)/(1- P(Y=1)) = β0+ β1x

Page 13: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Proc logistic

The following statements are available in PROC LOGISTIC: PROC LOGISTIC < options >; BY variables ; CLASS variable ;MODEL response = < effects > < / options >; MODEL events/trials = < effects > < / options >; OUTPUT < OUT=SAS-data-set >

< keyword=name...keyword=name > / < option >;

The PROC LOGISTIC and MODEL statements are required; only one MODEL statement can be specified. The CLASS statement (if used) must precede the MODEL statement.

Page 14: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

High school data

• The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and science studies.

• The response variable is high writing test score (high_write), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low;

• from which we explore its relationship with gender, reading test score (read), and science test score (science).

Page 15: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

High school data

data new ;set d.hsb2;if write>=60 then high_write=1; else high_write=0;keep ID female math read science write high_write;run;

proc logistic data= new descending; model high_write = female read science;run;

Page 16: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Logistic output

Model Information Data Set WORK.NEW

Response Variable high_write Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring

Number of Observations Read 200 Number of Observations Used 200

This the data set used in this procedure.

This is the type of regression model that was fit to our data. The term logit and logistic are exchangeable.

Page 17: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Logistic output

Response Profile

Ordered high_ Total Value write Frequency

1 1 53 2 0 147

Probability modeled is high_write=1.

Ordered value refers to how SAS models the

levels of the dependent variable. When we

specified the descending option, SAS treats the levels in a descending

order (high to low), such that when the regression

coefficients are estimated, a positive

coefficient corresponds to a positive relationship for high write status. By default SAS models the

lower level This is a note informing which level of the response variable we are modeling.

Page 18: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Logistic output

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics

Intercept Intercept and Criterion Only Covariates

AIC 233.289 168.236 SC 236.587 181.430 -2 Log L 231.289 160.236

This describes whether the maximum-likehood

algorithm has converged or not, and what kind of

convergence criterion is used

to asses convergence.

Model with no predictors just intercept tem

These are various

measurements used to assess the model fit. The smaller values the better fit.

The fitted model

Page 19: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Logistic output

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 71.0525 3 <.0001 Score 58.6092 3 <.0001 Wald 39.8751 3 <.0001

These are three asymptotically equivalent Chi-Square tests. They test against the null hypothesis that all of the predictors'

regression coefficient are equal to zero in the model. With P<0.001, we will reject Ho and conclude that at least one of the

predictors' regression coefficient is not equal to zero.

Page 20: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Logistic output

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -12.7772 1.9759 41.8176 <.0001 female 1 1.4825 0.4474 10.9799 0.0009 read 1 0.1035 0.0258 16.1467 <.0001 science 1 0.0948 0.0305 9.6883 0.0019

Here are the parameter estimates along with their P-value. Base on the estimates, our model is log[ p / (1-p) ] = -12.78 + 1.48*female + 0.10*read + 0.09*science.

Page 21: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167

The odds ratio is obtained by exponentiating the Estimate, exp[Estimate]. We can interpret the odds ratio as follows: for a one unit change in the predictor variable, the odds ratio for a positive outcome is expected to change by the respective coefficient,

given the other variables in the model are held constant.

Logistic output

Page 22: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits female 4.404 1.832 10.584 read 1.109 1.054 1.167 science 1.099 1.036 1.167

If the 95% CI does not cover 1, it suggests

the estimate is statistically significant

Logistic output

Page 23: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Weighted Example

• Just as with linear regression, logistic regression allows you to look at the effect of multiple predictors on an outcome.

• Consider the following example: 15- and 16-year-old adolescents were asked if they have ever had sexual intercourse. – The outcome of interest is intercourse. – The predictors are race (white and black) and gender (male and female).

Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002.

Page 24: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Here is a table of the data:

Intercourse

Race Gender Yes No

White Male 43 134

Female 26 149

Black Male 29 23

Female 22 36Raul Cruz-Cano, HLTH653 Spring 2013

Page 25: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Data Set Intercourse

DATA intercourse;INPUT white male intercourse count;

DATALINES;1 1 1 431 1 0 1341 0 1 261 0 0 1490 1 1 290 1 0 230 0 1 220 0 0 36;RUN;

Page 26: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

SAS:

• “descending” models the probability that intercourse = 1 (yes) rather than = 0 (no).

• “rsquare” requests the R2 value from SAS; it is interpreted the same way as the R2 from linear regression.

• “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit Test. This tells you if the model you have created is a good fit for the data.

PROC LOGISTIC DATA = intercourse descending; weight count;MODEL intercourse = white male/rsquare lackfit;

RUN;

Page 27: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

SAS Output: R2

Page 28: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Interpreting the R2 value

The R2 value is 0.9907. This means that 99.07% of the variability in our outcome (intercourse) is explained by including gender and race in our model.

Page 29: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

PROC LOGISTIC Output

The odds of having intercourse is 1.911 times greater for males versus females.

Page 30: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Hosmer and Lemeshow GOF Test

Page 31: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

H-L GOF Test

The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses:Ho: the model is a good fit, vs. Ha: the model is NOT a good fit

With this test, we want to FAIL to reject the null hypothesis, because that means our model is a good fit (this is different from most of the hypothesis testing you have seen).

Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit.

In this case, the pvalue = 0.2419, so we do NOT reject the null hypothesis, and we conclude the model is a good fit.

Page 32: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Model Selection in SAS

• Can be applied to both Linear and Logistic Models• Often, if you have multiple predictors and interactions in your model, SAS

can systematically select significant predictors using forward selection, backwards selection, or stepwise selection.

• In forward selection, SAS starts with no predictors in the model. It then selects the predictor with the smallest p-value and adds it to the model. It then selects another predictor from the remaining variables with the smallest p-value and adds it to the model. It continues doing this until no more predictors have p-values less than 0.05.

• In backwards selection, SAS starts with all of the predictors in the model and eliminates the non-significant predictors one at a time, refitting the model between each elimination. It stops once all the predictors remaining in the model are statistically significant.

Page 33: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Forward Selection in SAS

We will let SAS select a model for us out of the three predictors: white, male, white*male. Type the following code into SAS:

PROC LOGISTIC DATA = intercourse descending; weight count;MODEL intercourse = white male white*male/selection = forward lackfit;

RUN;

Page 34: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Output from Forward Selection: “white” is added to the model

Page 35: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

“male” is added to the model

Page 36: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

No more predictors are found to be statistically significant

Page 37: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

The Final Model:

Page 38: EIPB 698E Lecture 10 Raul Cruz-Cano Fall 2013. Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.

Hosmer and Lemeshow GOF Test: The model is a good fit