6. Multiple regression - PROC GLMpublicifsv.sund.ku.dk/~kach/SAS/6. The general linear...Multiple...
Transcript of 6. Multiple regression - PROC GLMpublicifsv.sund.ku.dk/~kach/SAS/6. The general linear...Multiple...
6. Multiple regression - PROC GLM
Karl B Christensenhttp://192.38.117.59/~kach/SAS
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Contents
Analysis of covariance (ANCOVA)
the general linear model
Interaction
Multiple regression
Automatic variable selection
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Data example: lung capacity
Data from 32 patients subject to a heart/lung transplantation.TLC (Total Lung Capacity) is determined from whole-bodyplethysmography. Are men and women different with respect tototal lung capacity?
OBS SEX AGE HEIGHT TLC
1 F 35 149 3.40
2 F 11 138 3.41
3 M 12 148 3.80
. . . . .
. . . . .
29 F 20 162 8.05
30 M 25 180 8.10
31 M 22 173 8.70
32 M 25 171 9.45
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Box plots for comparison of sex groups
PROC GPLOT DATA=TLCdata;
PLOT tlc*sex / HAXIS=AXIS1 VAXIS=AXIS2;
AXIS1 LABEL=(H=3) VALUE=(H=2) OFFSET =(6,6)CM;
AXIS2 LABEL=(H=3 A=90) VALUE =(H=2);
SYMBOL1 V=CIRCLE H=2 I=BOX10TJ W=3;
RUN; QUIT;
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Box plots for comparison of sex groups
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Group comparisons
Using t-tests1
PROC TTEST DATA=tlc;
CLASS sex;
VAR tlc height;
RUN;
Output
T-Tests
Variable Method Variances DF t Value Pr > |t|
TLC Pooled Equal 30 -3.67 0.0009
TLC Satterthwaite Unequal 29.7 -3.67 0.0009
Height Pooled Equal 30 -3.73 0.0008
Height Satterthwaite Unequal 29.5 -3.73 0.0008
Obvious sex difference for TLC as well as for Height
1Note that we can specify more than one variable in the VAR statement.Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Confounding when comparing groups
Occurs if the distributions of some other relevant explanatoryvariables differ between the groups. Here “relevant” meansthings we would have liked to be the same (or at least verysimilar) for everybody, because we think of it as noise ordistortion.
Can be reduced by performing a regression analysis with therelevant variables as covariates.
Confounding could be a problem in the current example, if weintended to compare the lung function between men andwomen of similar height
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Relation between TLC and HEIGHT
PROC GPLOT DATA=TLCdata;
PLOT tlc*height=sex / HAXIS=AXIS1 VAXIS=AXIS2;
AXIS1 LABEL=(H=4) VALUE =(H=3) MINOR=NONE;
AXIS2 LABEL=(A=90 H=4) VALUE=(H=3) ORDER =(3 TO 10) MINOR=NONE;
SYMBOL1 C=RED V=DOT H=2 I=SM75S L=1 W=3 MODE=INCLUDE;
SYMBOL2 C=BLUE V=CIRCLE H=2 I=SM75S L=41 W=3 MODE=INCLUDE;
LEGEND1 LABEL =(H=2.5) VALUE =(H=2 JUSTIFY=LEFT);
RUN; QUIT;
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Relation between TLC and HEIGHT
���
������
������
�� �� �� �� � � ��
��� � �
���
������
������
�� �� �� �� � � ��
��� � �
(Plotted using I=RL)
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Analysis of covariance
Comparison of parallel regression lines
Model: ygi = αg + βxgi + εgi g = 1, 2; i = 1, · · · , ngHere α2 − α1 is the expected difference in the responsebetween the two groups for fixed value of the covariate, thatis, when comparing any two subjects who have the same valueof (match on) the covariate x (“adjusted for x”).
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
But what if the lines are not parallel?More general model: ygi = αg + βgxgi + εgi
If β1 6= β2 there is an interaction between Height and Sex
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Interaction
Interaction between Height and Sex
The effect of height depends on sex
The difference between men and women depends on height
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Model with interaction
PROC REG only works for linear covariates. Group variables can behandled directly in PROC GLM by specifying the group variable as aCLASS variable.
PROC GLM DATA=TLCdata;
CLASS sex;
MODEL tlc=sex height sex*height / SOLUTION;
RUN; QUIT;
The option SOLUTION is needed if we want to see the regressionparameter estimates.
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
PROC GLM output
The GLM Procedure
Class Level Information
Class Levels Values
Sex 2 F M
Number of Observations Read 32
Number of Observations Used 32
Dependent Variable: TLC
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 42.81845030 14.27281677 10.28 <.0001
Error 28 38.89354970 1.38905535
Corrected Total 31 81.71200000
R-Square Coeff Var Root MSE TLC Mean
0.524017 19.36069 1.178582 6.087500
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
PROC GLM output
Source DF Type I SS Mean Square F Value Pr > F
Sex 1 25.31161250 25.31161250 18.22 0.0002
Height 1 17.48233164 17.48233164 12.59 0.0014
Height*Sex 1 0.02450616 0.02450616 0.02 0.8953
Source DF Type III SS Mean Square F Value Pr > F
Sex 1 0.07951043 0.07951043 0.06 0.8127
Height 1 17.36061701 17.36061701 12.50 0.0014
Height*Sex 1 0.02450616 0.02450616 0.02 0.8953
The interaction is not significant.
The Type III p-values for the two main effects should neverbe used for anything in a model including the interaction!
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
PROC GLM output
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept -5.827971333 B 4.97706299 -1.17 0.2515
Sex F -1.727664141 B 7.22116113 -0.24 0.8127
Sex M 0.000000000 B . . .
Height 0.073564647 B 0.02854339 2.58 0.0155
Height*Sex F 0.005743619 B 0.04324220 0.13 0.8953
Height*Sex M 0.000000000 B . . .
These are the regression parameters
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Where are the two lines in the output?
Line for males (the reference group):
TLC = -5.828 + 0.07356 × Height
Line for females:
TLC = −5.828 + (−1.727) + (0.07356 + 0.00574)× Height
= −7.555 + 0.07930× Height
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Same model, new parameterization
PROC GLM DATA=TLCdata;
CLASS sex;
MODEL tlc=sex sex*height / NOINT SOLUTION;
RUN; QUIT;
Output (edited)
Standard
Parameter Estimate Error t Value Pr > |t|
Sex F -7.555635475 5.23201797 -1.44 0.1598
Sex M -5.827971333 4.97706299 -1.17 0.2515
Height*Sex F 0.079308266 0.03248326 2.44 0.0212
Height*Sex M 0.073564647 0.02854339 2.58 0.0155
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Two different parameterizations
PROC GLM DATA=TLCdata;
CLASS sex;
MODEL tlc=sex height sex*height / SOLUTION;
RUN; QUIT;
(extrapolated) level at Height=0 for reference group
(extrapolated) difference between groups at Height=0
An effect of Height (slope) for the reference group
The difference between the slopes for the two sexes
PROC GLM DATA=TLCdata;
CLASS sex;
MODEL tlc=sex sex*height / NOINT SOLUTION;
RUN; QUIT;
The (extrapolated) level at Height=0 for each group
The effect of Height (the slope) for each group
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Model without interaction
No indication of interaction, we omit the term
PROC GLM DATA=TLCdata;
CLASS sex;
MODEL tlc=sex height / SOLUTION CLPARM;
RUN; QUIT;
Are there also other possible parameterizations in this model? (andwhich one should we use?)
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Source DF Type I SS Mean Square F Value Pr > F
Sex 1 25.31161250 25.31161250 18.86 0.0002
Height 1 17.48233164 17.48233164 13.03 0.0011
Source DF Type III SS Mean Square F Value Pr > F
Sex 1 3.24523555 3.24523555 2.42 0.1308
Height 1 17.48233164 17.48233164 13.03 0.0011
Note: The effect of sex seen in the group comparison hasdisappeared!!
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Model without interaction - results
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept -6.263569903 B 3.67983781 -1.70 0.0994
Sex F -0.770859760 B 0.49571132 -1.56 0.1308
Sex M 0.000000000 B . . .
Height 0.076067188 0.02107532 3.61 0.0011
Parameter 95% Confidence Limits
Intercept -13.78968328 1.262543472
Sex F -1.784703236 0.242983716
Sex M . .
Height 0.032963311 0.119171065
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Confounding?
In this example it seems that
1 The observed difference in lung capacity between men andwomen can be explained by height differences
2 However, there may still be a sex difference for persons of thesame height (women vs. men), estimated as−0.77± 2× 0.50 = (−1.78, 0.24)
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
But. . .
what if we did not have the two very short men to pull the line forthe men? Let us look at the subjects above 152 cm (using thestatement WHERE height>152; in PROC GLM). Test of interaction:
Source DF Type I SS Mean Square F Value Pr > F
Sex 1 22.10748067 22.10748067 19.92 0.0002
Height 1 0.25519165 0.25519165 0.23 0.6361
Height*Sex 1 2.76108429 2.76108429 2.49 0.1284
Estimated additive effects
Parameter Estimate 95% Confidence Limits Pr > |t|
Intercept 10.53318707 B -3.47974795 24.54612210 0.1339
Sex F -2.04829071 B -3.40956535 -0.68701607 0.0048
Sex M 0.00000000 B . . .
Height -0.01778053 -0.09665451 0.06109345 0.6459
A somewhat different conclusion. . .
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Plots for model checking in the HTML output:
ODS GRAPHICS ON;
PROC GLM DATA=TLCdata PLOTS=( DIAGNOSTICS RESIDUALS(SMOOTH ));
CLASS sex;
MODEL tlc=sex height sex*height / SOLUTION;
OUTPUT OUT=WithResid RSTUDENT=NormResidWithoutCurrent;
RUN; QUIT;
PROC GPLOT DATA=WithResid;
PLOT NormResidWithoutCurrent * sex;
SYMBOL1 V=CIRCLE H=2 I=BOX10TJ W=3;
RUN; QUIT;
In addition to the ODS GRAPHICS plots for PROC GLM, residualsshould be plotted against each of the CLASS variables (here sex) inorder to check variance homogeneity
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Exercise: Another look the Juul data
1 Get the data into SAS using a libname statement.
2 Create a new data set including only individuals above 25years, and make a new variable with log-transformed SIGF1.
3 Use PROC GPLOT to plot the relationship between age andlog-transformed SIGF-I.
4 Make separate regression lines for men and women.
5 Do a regression analysis to explore if slopes are equal in menand women.
6 Give an estimate for the difference in slopes, with 95%confidence interval.
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Multiple regression. General linear model (GLM).
Data: n sets of observations, made on the same ’unit’:
unit x1....xp y
1 x11....x1p y12 x21....x2p y23 x31....x3p y3. . . . . . . .n xn1....xnp yn
The linear regression model with p explanatory variables(covariates) is written:
y = β0 + β1x1 + · · ·+ βpxp + ε
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Interpretation of regression coefficients
ModelYi = β0 + β1Xi1 + β2Xi2 + ...+ βpXip + ε
where ε ∼ N(0, σ2). Consider two subjects:A has covariate values (X1,X2, . . . ,Xp)B has covariate values (X1 + 1,X2, . . . ,Xp)Expected difference in the response (B − A)
[β0 + β1(X1 + 1) + β2Xi2 + ...]− [β0 + β1X1 + β2Xi2 + ...] = β1
This means that β1 is the effect of one unit’s difference in X1 forfixed levels of the other variables (X2, . . . ,Xp)
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
School-age obesity data
School-age obesity score versus height and weight measured at 1year of age
Obs Obesity Height1 Weight1
1 -0.06967 79 11.70
2 -0.79982 72 9.55
3 2.67337 76 9.95
. . . .
. . . .
196 0.47968 78 10.60
197 -0.61818 77 10.10
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
School-age obesity data
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
SAS code
PROC REG DATA=SchoolObesity;
MODEL Obesity = Height1 Weight1 / CLB;
RUN; QUIT;
(part of the) output
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 0.18668 1.16769 0.16 0.8731
Height1 1 -0.06644 0.02163 -3.07 0.0024
Weight1 1 0.47653 0.07097 6.71 <.0001
Parameter Estimates
Variable DF 95% Confidence Limits
Intercept 1 -2.11631 2.48967
Height1 1 -0.10910 -0.02379
Weight1 1 0.33656 0.61650
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Interpretation of regression parameters
Remember that βj is the effect of the j’th explanatory variable,corrected for the effect of the other explanatory variables, that is,when comparing any two subject who match on all the othervariables.
The effect of Height1 corrected for the effect of Weight1 isfound to be β̂1 = −0.066 (95% CI: −0.109 to −0.024),p = 0.0024
In the univariate model without correction for Weight1 we gotβ̂1 = +0.048 (95% CI: +0.019 to +0.077), p = 0.0014
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Interpretation of regression parameters
The parameter for height answers two different questionsdepending on whether or not adjusted for weight:
Unadj. ’Are big 1-year-old children generally fatter during schoolage?’
Adj. ’Are slim 1-year-old children generally slimmer during schoolage?’
Both questions are relevant and both answers are valid!
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Relative effects and products or ratios of covariates
Both issues are solved by log-transforming the covariate(s)!Example: BMI = Weight/Height2 is a ratio measure. Logarithmicrules give
log(BMI) = log(Weight) − 2·log(Height)
so β·log(BMI) = β·log(Weight) −2β·log(Height)Choice of log-transformation of covariates
Use of log10 means that the regression parameter shows theeffect of two subjects differing by a factor 10. Do not uselog10 unless it is likely for two subjects to differ by a factor 10!
Use log2 [SAS code: LOG2(·)] when doubling is likely.
Use a covariate calculated as
XX=LOG(·)/LOG(1.1)
if 10% differences are likely.
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
BMI at age 1 appropriate predictor for school-age obesity?
1 BMI is a ratio measure involving weight and height, so weshould investigate log-transformed weight and height
2 Doubling is not a realistic difference, so we look at “per 10%”
DATA School1;
SET SchoolObesity;
HeightPer10pct = LOG(Height1 )/LOG (1.1);
WeightPer10pct = LOG(Weight1 )/LOG (1.1);
RUN;
PROC REG DATA=School1;
MODEL Obesity = HeightPer10pct WeightPer10pct / CLB;
TestBMI: TEST HeightPer10pct = -2* WeightPer10pct;
RUN; QUIT;
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Part of output
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Conf. Limits
Intercept 1 9.80370 5.95742 1.65 0.1015 -1.94593 21.55334
HeightPer10pct 1 -0.45993 0.15673 -2.93 0.0037 -0.76904 -0.15082
WeightPer10pct 1 0.45679 0.06714 6.80 <.0001 0.32437 0.58922
Test TestBMI Results for Dependent Variable Obesity
Mean
Source DF Square F Value Pr > F
Numerator 1 15.72218 20.34 <.0001
Denominator 194 0.77312
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Conclusion:
1 10% higher weight increases the expected school-age obesityscore by 0.456 (95% CI: 0.324 - 0.589),
2 10% lower height increases the expected school-age obesityscore by 0.460 (95% CI: 0.151 - 0.769),
3 BMI at age 1 year is not an appropriate choice (p < 0.0001).
4 Since the regression parameters for the log-transformed weightand height are of the same size, but with opposite signs, anappropriate predictor would be the ratio weight/height
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Model selection
Lung function - 25 patients with cystic fibrosis2
2O’Neill et al (1983).Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Which covariates have a univariate effect on the outcome PEmax?
Are these the variables to be included in the model?
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Model with all covariates
PROC REG DATA=pemax;
MODEL pemax=age sex height weight bmp fev1 rv frc tlc;
RUN; QUIT;
output
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 176.05821 225.89116 0.78 0.4479
age 1 -2.54196 4.80170 -0.53 0.6043
sex 1 -3.73678 15.45982 -0.24 0.8123
height 1 -0.44625 0.90335 -0.49 0.6285
weight 1 2.99282 2.00796 1.49 0.1568
bmp 1 -1.74494 1.15524 -1.51 0.1517
fev1 1 1.08070 1.08095 1.00 0.3333
rv 1 0.19697 0.19621 1.00 0.3314
frc 1 -0.30843 0.49239 -0.63 0.5405
tlc 1 0.18860 0.49974 0.38 0.7112
No significant effects. . .
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Automatic variable selection: Forward selection
Start with no covariates. In every step, add the most significantvariable
PROC REG DATA=pemax;
MODEL pemax=age sex height weight bmp fev1 rv frc tlc
/ SELECTION=FORWARD;
RUN; QUIT;
Final model: Weight BMP FEV1
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
Automatic variable selection: Backward elimination
Start with all covariates. At each step, omit the least significantvariable
PROC REG DATA=pemax;
MODEL pemax=age sex height weight bmp fev1 rv frc tlc
/ SELECTION=BACKWARD;
RUN; QUIT;
Final model: Weight BMP FEV1
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM
But. . .
There is no guarantee that these automatic methods will give usthe same result:
Had observation no. 25 not been in the data set, backwardelimination would have excluded Height as the first variable,while forward selection would have included Height as the firstvariable!
A ’best’ automatic method has not been identified, butbackward elimination is often recommended over forwardselection.
WARNING: Output from selected model does not take modelselection uncertainty into account: The output (regressioncoefficients and p-values) is identical to what would have beenobtained had we fitted the final model with out doing anymodel selection. The importance of the selected covariates isover-estimated!
Karl B Christensenhttp://192.38.117.59/~kach/SAS 6. Multiple regression - PROC GLM