Other types of regression models Analysis of variance and...
Transcript of Other types of regression models Analysis of variance and...
Analysis of variance and regression
Other types of regression models
Other types of regression 1
Response with only two categories
• Odds ratio and risk ratio
• Quantitative explanatory variable
• More than one variable
• Logistic regression
• Case-control designs
Other types of regression 2
Other types of regression models
• Generalised linear models (not the same as General
Linear Models)
• Ordinal data: Proportional odds model (ordinal
regression)
• Counts: Poisson model
• Survival analysis (censored, time-to-event data): Cox
proportional hazards model
• (Other types of censored data)
Other types of regression 3
Example
Use of contraception at first intercourse
Data collected in a study of risk factors for cervical neoplasia(Susanne Kruger Kjær)
Cohort of 11088 women aged 20 to 29, inhabitants of Copenhagen in1993.
Among other things, the women were asked which type ofcontraception, if any, they used at their first sexual intercourse. Asub-study (Edith Svare et al., 2002) investigated predictors for no useof contraception at the first intercourse among 10839 women (244virgins and 5 non-responders excluded).
Other types of regression 4
Contraception at 1st intercourse and smoking
Does smoking status at the age at first intercourse matter?
Use of contraception
Smoker No Yes Total
Ever 1199 2609 3808
Never 1505 5526 7031
Total 2704 8135 10839
Chi-square test: χ2(1)=134, p<.0001
Quantification of the effect:
Risk ratio: 11993808 /
15057031 = 0.315/0.214 = 1.47
Odds ratio: 11992609 /
15055526 = 1199·5526
2609·1505 = 1.69
Other types of regression 5
Odds ratio (OR)
• Odds are the ratio between the number of events of the twotypes, e.g., sick/healthyIn probabilities
odds =P (sick)
P (healthy)=
P (sick)1− P (sick)
• OR=2 implies a doubling of the number of sick individuals forvery-low-risk populations (for rare diseases OR ≈ RR) and ahalving of the number of healthy individuals for very-high-riskpopulations [RR doesn’t work well for high-risk populations]
Other types of regression 6
Odds ratio (OR) and risk ratio (RR)
• – RR is limited by the fact that the risk (probability) cannotexceed 1. This may lead to problems if both high-risk andlow-risk populations exist.
– OR may vary freely among all positive numbers no matter theproportion of diseased in the reference group.
• RR is always between 1 and OR.
• – OR is symmetric. If the opposite event is in focus, OR issimply inverted:
OR(without contraception) = 1/OR(with contraception)
– RR is asymmetric, thus which outcome to model nicely mustbe decided based on substantial arguments (impossible for thepresent example)
Other types of regression 7
Odds ratio (OR)
Interpretation – beyond the doubling/halving aspect – well . . .
Deeper insight in the exact value of the OR probably requiresfamiliarity with odds — perhaps some gamblers find it trivial . . .
BUT: every statistical analysis intend to describe reality in thesimplest possible way. The approximation to reality will never beperfect, but the measure of coherence should preferably applyacceptably to as many people as possible.
OR has the advantage that “the same effect” may apply to both highrisk and low risk populations. Thus it is more likely that a singlenumber can be used to estimate the association for all the relevantindividuals
Other types of regression 8
Use of contraception at the first intercourse
Birth Contraception
year No Yes Total
1961 36 81 117
1962 292 594 886
1963 348 902 1250
1964 342 861 1203
1965 313 871 1184
1966 317 889 1206
1967 313 864 1177
1968 246 875 1121
1969 200 849 1049
1970 145 694 839
1971 110 484 594
1972 42 171 213
Total 2704 8135 10839
Association with year of birth:
χ2(11)-test statistic=117.435,
p=0.001.
Other types of regression 9Odds of unprotected intercourse
Pattern:Straight line with logarithmic verticalaxis.
Mathematically:
• logarithmic axis: ln(odds)
• straight line: ln(odds for birth year X) =
a+ b ·X
ln(OR year x+1 relative to year x) = ln([odds year x+1]/[odds year x])= ln(odds year x+1) − ln(odds year x) = a+ b(x+ 1)− (a+ bx) = b,so OR = exp(b).
P (first intercourse unprotected | birth year X) = exp(a+bX)1+exp(a+bX)
Other types of regression 10Combining the variables
Pattern: Parallel course of two linear curves with
logarithmic vertical axis.
Mathematically:
• linear:
ln(odds for a never smoker born year X)
= anever smoker + b ·X• Parallel:
ln(odds for an ever smoker born year X) =
ln(odds for a never smoker born year X) + c
= anever smoker + b ·X + c
An ever smoker compared to a never smoker born in the same year:
ln(OR) = ln(odds for the ever smoker) − ln(odds for the never smoker)= c,
so OR = exp(c).
Other types of regression 11
Logistic regression model
The response Y is always either 0 or 1. We model the mean whichequals the probability P (Y = 1).
X1, . . . , Xk explanatory variables, the “exposures”.
The usual linear approach is not appropriate because probabilitiesare between 0 and 1, and straight lines will cross these limits (unlessthey are horizontal ∼ no association). One solution to that problemis to use a linear model after transforming the probability using thelogit transformation:
logit(p) = ln(p
1− p )
Other types of regression 12
Logistic regression model cont.
Response and exposure are “linked” using logit:
logit(P (Y = 1 |X1 = x1, . . . , Xk = xk)) = b0 + b1x1 + b2x2 + . . .+ bkxk
ln(odds for unprotected 1st intercourse for a never/ever smoker born year X)=anever smoker + b ·X + c · 1{smoker}
Estimation using SAS
DATA praevent; SET praevent; NumberOfTrials=1; RUN;
PROC GENMOD DATA=praevent;
CLASS smoker;
MODEL NoContra/NumberOfTrials = smoker byear
/ DIST=BIN LINK=LOGIT TYPE1 TYPE3 ;
RUN;
[SAS manual: http://support.sas.com/onlinedoc/913/docMainpage.jsp]
Other types of regression 13
Output from PROC GENMOD
The GENMOD Procedure
Model Information
Data Set WORK.PRAEVENT
Distribution Binomial
Link Function Logit
Response Variable (Events) NoContra
Response Variable (Trials) NumberOfTrials
Observations Used 10839
Number Of Events 2704
Number Of Trials 10839
Class Level Information
Class Levels Values
smoker 2 Yes ~No
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 11E3 11936.3513 1.1015
: : : :
: : : :
Log Likelihood -5968.1756
Algorithm converged.
Other types of regression 14
Output from PROC GENMOD
Analysis Of Parameter Estimates
Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr>ChiSq
Intercept 1 165.1749 16.0237 133.7691 196.5807 106.26 <.0001
smoker Yes 1 0.5390 0.0457 0.4494 0.6286 139.06 <.0001
smoker ~No 0 0.0000 0.0000 0.0000 0.0000 . .
byear 1 -0.0847 0.0082 -0.1007 -0.0687 107.92 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.
LR Statistics For Type 1 Analysis
Chi-
Source Deviance DF Square Pr > ChiSq
Intercept 12177.6509
smoker 12046.3475 1 131.30 <.0001
byear 11936.3513 1 110.00 <.0001
LR Statistics For Type 3 Analysis
Chi-
Source DF Square Pr > ChiSq
smoker 1 137.70 <.0001
byear 1 110.00 <.0001
Other types of regression 15
“Translation” of parameter estimates
Association with ever smoking:
OR for ever smoker versus never smoker = exp(0.5390)=1.71,95% confidence limits: exp(0.4494)=1.57 to exp(0.6286)=1.87.
Association with birth year:
OR per year = exp(-0.0847)=0.919,95% confidence limits: exp(-0.1007)=0.904 to exp(-0.0687)=0.934
OR per 5 years = exp(5 · -0.0847) = exp(-0.4235)=0.65,95% confidence limits: exp(5 · -0.1007)=exp(-0.5035)=0.60
to exp(5 · -0.0687)=exp(-0.3435)=0.71Note: First multiply by 5, then take the exponential
Other types of regression 16
Case-control design
OR =# diseased exposed
# healthy exposed/
# diseased unexposed
# healthy unexposed
=# diseased exposed ·# healthy unexposed# healthy exposed ·# diseased unexposed
=# diseased exposed
# diseased unexposed/
# healthy exposed# healthy unexposed
is symmetric in exposure and outcome. This is the basis for thecase-control studies examining exposure among “cases” and suitable“controls”.
Other types of regression 17
Matched case-control designs
In frequency-matched case-control design, the matching variablemust be included in the model – the same way as the matching isdone (e.g., frequency-matching in two-years interval → class variablegrouping in two-years intervals)
Individually-matched case-control sampling designs must beanalyzed using conditional logistic regression.
These designs cannot be analyzed using PROC GENMOD; the analysesrequire programs suitable for survival or “event time” analysis.
Other types of regression 18
Individually matched case-control designs
DATA matched; SET rawdata;
if case=1 then dummy_time=1; * cases ;
if case=0 then dummy_time=2; * controls ;
RUN;
PROC PHREG DATA=matched NOSUMMARY;
MODEL dummy_time*case(0)=exposure;
STRATA matchgrp;
RUN;
Here, the variable dummy_time is set to 1 for cases and 2 for controlsto ensure, that the “event time” for the controls is later than the“event time” for the case. NOSUMMARY is not necessary, but is includedto avoid a print-out of number of cases (“Events”) and number ofcontrols (“Censored”) for each single matched pair/group.
Other types of regression 19
Other types of regression models:
Until now, we have been looking at
• regression for normally distributed data,
where parameters describe
– differences between groups
– expected difference in outcome for one unit’s
difference in an explanatory variable
• regression for binary data, logistic regression,
where parameters describe
– odds ratios for one unit’s difference in an
explanatory variable
Other types of regression 20
Generalised linear models:Multiple regression models, on a scale suitable for the data:
Mean: M
Link function: g(M) linear in covariates, that is,
g(M) = b0 + b1x1 + · · ·+ bkxk
Some standard distributions (and link functions):
• Normal distribution (link=IDENTITY): the general linear model
• Binomial distribution (link=LOGIT): logistic regression
Other types of regression 21
What about something ’in between’?
• ordered categorical variable with more than 2
categories
(ordinal regression (link=LOGIT))
– degree of pain (none/mild/moderate/serious)
– degree of liver fibrosis
• counts (Poisson distribution (link=LOG))
– number of cancer cases in each municipality per year
– number of positive pneumocock swabs
Other types of regression 22
Ordinal data, for example, level of pain
• data on a rank (ordered) scale
• distance between response categories is not known / is
undefined
• often an imaginary underlying continuous scale
Covariates are intended to describe the probability for
each response category, and the effect of each covariate is
likely to be a general shift in upwards/downwards
direction in contrast to, for example, decreasing
probabilities of both extremes simultaneously (like a
treatment that stabilizes a condition)
Other types of regression 23
Possibilities based on knowledge sofar:
• We can pretend that we are dealing with normally
distributed data
– of course most reasonable, when there are many
response categories, and no floor or ceiling effects
• We may reduce to a two-category outcome and use
logistic regression
– but there are several possible cut points/thresholds
Alternative: Proportional odds
Other types of regression 24
Example on liver fibrosis (degree 0,1,2 or 3),
(Julia Johansen, KKHH)
3 blood markers related to fibrosis:
• ha
• ykl40
• pIIInp
Problem:
What can we say about the degree of fibrosis from the
knowledge of these 3 blood markers?
Other types of regression 25
We start out simple,
with one single blood marker xi for the i’th patient(here: i = 1, . . . , 126).
Proportional odds model, model for ’cumulative logits’:
logit(qik) = log
(qik
1− qik
)= ak + b× xi,
or, on the original probability scale:
qik = qk(xi) =exp(ak + bxi)
1 + exp(ak + bxi), k = 1, 2, 3
Other types of regression 26
Properties of the proportional odds model:
• the odds ratio does not depend on the cut point, only
on the covariates
log
(qk(x1)/(1− qk(x1))
qk(x2)/(1− qk(x2))
)= b× (x1 − x2)
• reversing the ordering of the categories only implies
a change of sign for the log odds parameters
Other types of regression 27
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
------------------------------------------------------------------
degree_fibr 129 1.4263566 0.9903850 0 3.0000000
ha 128 318.4531250 658.9499624 21.0000000 4730.00
ykl40 129 533.5116279 602.2934049 50.0000000 4850.00
pIIInp 127 13.4149606 12.4887192 1.7000000 70.0000000
------------------------------------------------------------------
Other types of regression 28
We start out using
only the marker HA
Very skewed distributions,
– but we do not demand
anything about these!?
Other types of regression 29
Proportional odds model in SAS:
DATA fibrosis;
INFILE "julia.tal" FIRSTOBS=2;
INPUT id degree_fibr ykl40 pIIInp ha;
IF degree_fibr<0 THEN DELETE;
RUN;
PROC LOGISTIC DATA=fibrosis DESCENDING;
MODEL degree_fibr=ha
/ LINK=LOGIT CLODDS=PL;
RUN;
Other types of regression 30
The LOGISTIC Procedure
Model Information
Data Set WORK.FIBROSIS
Response Variable degree_fibr
Number of Response Levels 4
Number of Observations 128
Model cumulative logit
Optimization Technique Fisher’s scoring
Response Profile
Ordered Total
Value degree_fibr Frequency
1 3 20
2 2 42
3 1 40
4 0 26
Probabilities modeled are cumulated over the lower Ordered Values.
Other types of regression 31
Score Test for the Proportional Odds Assumption
Chi-Square DF Pr > ChiSq
5.1766 2 0.0751
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 3 1 -2.3175 0.3113 55.4296 <.0001
Intercept 2 1 -0.4597 0.2029 5.1349 0.0234
Intercept 1 1 1.0945 0.2334 21.9935 <.0001
ha 1 0.00140 0.000383 13.3099 0.0003
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
ha 1.001 1.001 1.002
Profile Likelihood Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
ha 1.0000 1.001 1.001 1.002
Other types of regression 32
• The proportional odds assumption is just acceptable
• The scale of the covariate is no good
• Logarithmic transformation?
– We may have have influential observations
Other types of regression 33
With a view towards easy interpretation,
we use logarithms with base 2:
DATA fibrosis;
SET fibrosis;
l2ha=LOG2(ha);
RUN;
PROC LOGISTIC DATA=fibrosis DESCENDING;
MODEL degree_fibr=l2ha
/ LINK=LOGIT CLODDS=PL;
RUN;
Other types of regression 34
Score Test for the Proportional Odds Assumption
Chi-Square DF Pr > ChiSq
8.3209 2 0.0156
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 3 1 -8.3978 1.0057 69.7251 <.0001
Intercept 2 1 -5.9352 0.8215 52.1932 <.0001
Intercept 1 1 -3.7936 0.7213 27.6594 <.0001
l2ha 1 0.8646 0.1188 52.9974 <.0001
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
l2ha 2.374 1.881 2.996
Profile Likelihood Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
l2ha 1.0000 2.374 1.899 3.038
Other types of regression 35
Logarithms, yes or no? Results when using both:
PROC LOGISTIC DATA=fibrosis DESCENDING;
MODEL degree_fibr=l2ha ha
/ LINK=LOGIT;
RUN;
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 3 1 -10.6147 1.3029 66.3681 <.0001
Intercept 2 1 -8.1095 1.1415 50.4743 <.0001
Intercept 1 1 -5.7256 0.9818 34.0116 <.0001
l2ha 1 1.2368 0.1766 49.0723 <.0001
ha 1 -0.00141 0.000419 11.2724 0.0008
Other types of regression 36
PRO logarithm:
• the logarithmic transformation gives the strongest significance
• the logarithmic transformation presumably also gives fewer’influential observations’– because of the less skewed distribution
Other types of regression 37
PRO logarithm:
• using ha still adds information, so the model is not satisfactory,but the small and negative coefficient for ha shows that theuntransformed ha-variable serves to flatten the effect in the upperend of ha even more than the log-transformation of ha does![computational examples: log(OR) comparing ha=200 with ha=100 is
1.2368·(log2(200)− log2(100)) - 0.00141·(200-100) = 1.2368-0.141 =1.1,
while log(OR) comparing ha=2000 with ha=1000 is
1.2368·(log2(2000)− log2(1000)) - 0.00141·(2000-1000) = 1.2368-1.41 =-0.17]
CONTRA logarithm:
• the assumption of proportional odds gets worse
Conclusion:
• Log-transformation is more appropriate, but not perfect!
Other types of regression 38
Inclusion of all covariates:
DATA fibrosis;
SET fibrosis;
l2ha=LOG2(ha);
l2ykl40=LOG2(ykl40);
l2pIIInp=LOG2(pIIInp);
RUN;
PROC LOGISTIC DATA=fibrosis DESCENDING;
MODEL degree_fibr=l2ha l2ykl40 l2pIIInp
/ LINK=LOGIT CLODDS=PL;
RUN;
Other types of regression 39
Score Test for the Proportional Odds Assumption
Chi-Square DF Pr > ChiSq
9.6967 6 0.1380
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 3 1 -12.7767 1.6959 56.7592 <.0001
Intercept 2 1 -10.0117 1.5171 43.5506 <.0001
Intercept 1 1 -7.5922 1.3748 30.4975 <.0001
l2ha 1 0.3889 0.1600 5.9055 0.0151
l2ykl40 1 0.5430 0.1700 10.2031 0.0014
l2pIIInp 1 0.8225 0.2524 10.6158 0.0011
Other types of regression 40
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
l2ha 1.475 1.078 2.019
l2ykl40 1.721 1.233 2.402
l2pIIInp 2.276 1.388 3.733
Profile Likelihood Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
l2ha 1.0000 1.475 1.073 2.062
l2ykl40 1.0000 1.721 1.246 2.403
l2pIIInp 1.0000 2.276 1.375 3.829
Other types of regression 41
Model control for proportional odds model
1. Check of linearity
• include both x and log(x) (or a quadratic term or
linear spline or ...)
2. Check the assumption of identical slopes (bk)
for each choice of threshold (k)
(a) formal test in LOGISTIC output
(b) make separate logistic regressions for each choice of
threshold and compare the estimated coefficients
3. General fit of separate logistic regressions
• use option LACKFIT in MODEL statement in LOGISTIC
Other types of regression 42
Separate outcome-variable definition for each
possible threshold:
DATA fibrosis;
INFILE "julia.tal";
INPUT id degree_fibr ykl40 pIIInp ha;
IF degree_fibr<0 THEN DELETE;
l2ykl40=LOG2(ykl40);
l2pIIInp=LOG2(pIIInp);
l2ha=LOG2(ha);
fibrosis3=(degree_fibr=3);
fibrosis23=(degree_fibr>=2);
fibrosis123=(degree_fibr>=1);
RUN;
Other types of regression 43
Example of analysis with extract of the output(cut point between 1 and 2):
PROC LOGISTIC DATA=fibrosis DESCENDING;
MODEL fibrosis23=l2ha l2ykl40 l2pIIInp
/ LINK=LOGIT CLODDS=PL LACKFIT;
RUN;
Response Profile
Ordered Total
Value fibrosis23 Frequency
1 1 62
2 0 64
Probability modeled is fibrosis23=1.
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -12.5746 2.4701 25.9150 <.0001
l2ha 1 0.5842 0.2654 4.8446 0.0277
l2ykl40 1 0.5262 0.2595 4.1122 0.0426
l2pIIInp 1 1.2716 0.4256 8.9265 0.0028
Other types of regression 44
Check of general fit for standard logistic regression,
the LACKFIT-option:
• Splits the observations into 10 groups,
sorted according to increasing predicted probability
• compares observed and expected number of 1’s
• adds up to a χ2 (chi-square) statistic
Other types of regression 45
LACKFIT for threshold between 1 and 2:Partition for the Hosmer and Lemeshow Test
fibrosis23 = 1 fibrosis23 = 0
Group Total Observed Expected Observed Expected
1 13 1 0.25 12 12.75
2 13 0 0.53 13 12.47
3 13 1 1.01 12 11.99
4 13 0 2.04 13 10.96
5 13 8 5.99 5 7.01
6 13 8 8.38 5 4.62
7 13 11 10.39 2 2.61
8 13 12 11.84 1 1.16
9 13 12 12.63 1 0.37
10 9 9 8.95 0 0.05
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
7.8455 8 0.4487
Other types of regression 46
What about the individual patient’s probability of having
a specific degree of fibrosis?
Yi: the observed degree of fibrosis for the i’th patient.
We wish to specify the probabilities
pik = P (Yi = k), k = 0, 1, 2, 3
Since pi0 + pi1 + pi2 + pi3 = 1,
we have a total of 3 free parameters for each individual.
Other types of regression 47
The relation between the probabilities qi from the ordinal
regression and the probabilities for each degree of fibrosis,
from the top:
• split between 2 and 3: model for qi3 = pi3
• split between 1 and 2: model for qi2 = pi2 + pi3
• split between 0 and 1: model for qi1 = pi1 + pi2 + pi3
Probabilities for each degree of fibrosis (k) can be
calculated as successive differences:
p3(x) = q3(x)
pk(x) = qk(x)− qk+1(x), k = 0, 1, 2
Other types of regression 48
Calculation of probabilities for each single degree of fibrosis:PROC LOGISTIC DATA=fibrosis DESCENDING;
MODEL degree_fibr=l2ha
/ LINK=LOGIT;
OUTPUT OUT=new PRED=q_hat;
RUN;
Part of the SAS data set ’new’:
degree_
Obs id fibr ykl40 pIIInp ha _LEVEL_ q_hat
1 58 0 105 4.2 25 3 0.01234
2 58 0 105 4.2 25 2 0.12783
3 58 0 105 4.2 25 1 0.55512
4 79 0 111 3.5 25 3 0.01234
5 79 0 111 3.5 25 2 0.12783
6 79 0 111 3.5 25 1 0.55512
7 140 0 125 3.0 25 3 0.01234
8 140 0 125 3.0 25 2 0.12783
9 140 0 125 3.0 25 1 0.55512
Other types of regression 49
Additional data manipulations are necessary for thecalculation of the probabilities for each single degree offibrosis:
DATA b3;
SET new; IF _LEVEL_=3;
pred3=q_hat;
RUN;
DATA b2;
SET new; IF _LEVEL_=2;
pred2=q_hat;
RUN;
DATA b1;
SET new; IF _LEVEL_=1;
pred1=q_hat;
RUN;
DATA b123;
MERGE b1 b2 b3;
prob3=pred3;
prob2=pred2-pred3;
prob1=pred1-pred2;
prob0=1-pred1;
RUN;
Other types of regression 50
N
degree_fibr Obs Variable Mean Minimum Maximum
--------------------------------------------------------------------------
0 27 prob0 0.3726241 0.0963218 0.4990271
prob1 0.4435401 0.3794058 0.4893529
prob2 0.1632555 0.0955353 0.4384231
prob3 0.0205803 0.0099489 0.0858492
1 40 prob0 0.2747253 0.0021096 0.4448836
prob1 0.4076629 0.0155693 0.4893813
prob2 0.2453258 0.1154979 0.5440290
prob3 0.0722860 0.0123361 0.8256314
2 42 prob0 0.0807921 0.0019901 0.4448836
prob1 0.2552589 0.0147024 0.4775774
prob2 0.4264182 0.1154979 0.5473816
prob3 0.2375308 0.0123361 0.8338815
3 20 prob0 0.0473404 0.0011570 0.1180147
prob1 0.2170934 0.0086076 0.4145010
prob2 0.4300113 0.0939507 0.5479358
prob3 0.3055550 0.0696023 0.8962847
--------------------------------------------------------------------------
Other types of regression 51
Poisson distribution:
• distribution on the numbers 0, 1, 2, 3, . . .
• limit of binomial distribution for N large, p small,
mean: M = Np
– e.g., CNS cancer cases among registered cell phone
users
• probability of k events: P (Y = k) = e−MMk
k!
Example: Positive swabs for 90 individuals from 18 families
Other types of regression 52 Other types of regression 53
Illustration of family profiles
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
C
C C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C C
C
C
C
C
C
C
C
C
U
U
U
U
U
U
U U
U
U
U U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
Other types of regression 54
We observe counts (we ignore the grouping of families here)
Yfn ∼ Poisson(Mfn)
Additive model,
corresponding to two-way ANOVA in family and name:
log(Mfn) = M + af + bn
PROC GENMOD;
CLASS family name;
MODEL swabs=family name /
DIST=POISSON LINK=LOG CL;
RUN;
Other types of regression 55
The GENMOD Procedure
Model Information
Data Set WORK.A0
Distribution Poisson
Link Function Log
Dependent Variable swabs
Observations Used 90
Missing Values 1
Class Level Information
Class Levels Values
family 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
name 5 child1 child2 child3 father mother
Other types of regression 56
Analysis Of Parameter Estimates
Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 1.5263 0.1845 1.1647 1.8879 68.43 <.0001
family 1 1 0.4636 0.2044 0.0630 0.8641 5.14 0.0233
family 2 1 0.9214 0.1893 0.5503 1.2925 23.68 <.0001
family 3 1 0.4473 0.2050 0.0455 0.8492 4.76 0.0291
. . . . . . . . .
. . . . . . . . .
family 16 1 0.2283 0.2146 -0.1923 0.6488 1.13 0.2875
family 17 1 -0.5725 0.2666 -1.0951 -0.0499 4.61 0.0318
family 18 0 0.0000 0.0000 0.0000 0.0000 . .
name child1 1 0.3228 0.1281 0.0716 0.5739 6.34 0.0118
name child2 1 0.8990 0.1158 0.6721 1.1259 60.31 <.0001
name child3 1 0.9664 0.1147 0.7417 1.1912 71.04 <.0001
name father 1 0.0095 0.1377 -0.2604 0.2793 0.00 0.9451
name mother 0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.
Other types of regression 57
Interpretation of Poisson analysis:
• The family-parameters are uninteresting
• The name-parameters are interesting
• The mothers serve as the reference group
• The model is additive on a logarithmic scale, that is,
multiplicative on the original scale
Other types of regression 58
Parameter estimates:
name estimate (CI) ratio (CI)
child1 0.3228 (0.0716, 0.5739) 1.38 (1.07, 1.78)
child2 0.8990 (0.6721, 1.1259) 2.46 (1.96, 3.08)
child3 0.9664 (0.7417, 1.1912) 2.63 (2.10, 3.29)
father 0.0095 (-0.2604, 0.2793) 1.01 (0.77, 1.32)
mother - -
Interpretation:
The youngest children have a 2-3 fold increased probability
of infection, compared to their mother
Other types of regression 59
Survival analysis methods
Time-to-event data (censored “survival” data)
Examples:
• Time from diagnosis/start of treatment to death
• Time from first job to retirement
• Time from start of fertility treatment to pregnancy
Unlike most other types of outcome, there is a natural
focus on the probability of the outcome being equal to T
conditioning on that outcome is at least T (hazard or rate)
Other types of regression 60
Special issues with these data are:
• No specific idea about the distribution of the event times
• Time-to-event data are very often censored, that is, for someindividuals we only know that the event happens after a specifictime point:
– when evaluating the results, the relevant event had not yetoccurred
– patients withdraw from the study due to, for example, movingaway (or other causes unrelated to the event under study)
• Possibly delayed entry – some are not at risk for being observedwith the event in the study from the start
Other types of regression 61
Consequences of censoring:
• Descriptive statistics:
– We cannot use histograms, averages etc. (perhaps medians)
– Use instead the Kaplan-Meier estimator, a non-parametricestimator of the entire distribution of “survival” times,
S(t) = prob(T > t)
the probability of “surviving” (= not yet having experiencedthe event) at least until time t
• Statistical inference
– analysis of variance corresponds to log rank test
– normal regression models corresponds to Cox’s proportionalhazard regression models
Other types of regression 62
Step curve: a step down eachtime an event occurs.
The mathematical relation be-tween “survival” probability andthe cumulative rate
Sg(T ) = exp(−Rg(T ))
Rg(T ) = − ln(Sg(T ))
(Piecewise) constant rate gives(piecewise) linear cumulativerate and then a Poisson modelwould make better use of theavailable data
Other types of regression 63
Calculations of survival curve and cumulative rate
Other types of regression 64
Proportional hazards
The hazard (instantaneous rate) function is defined as:
r(t) ≈ P (the event happens immediately after time t | at risk at time t)
When comparing two groups, the hazard ratio (rate ratio) = rA(t)rB(t)
is usually assumed to be constant over time, that is, the effect of thetreatment is the same just after treatment as it is later on in life.
Other types of regression 65
Cox’s proportional hazards regression model
’Treatment vs. control’ may be considered as a binary explanatory
variable, x1 =
1 ∼ for active treatment group
0 ∼ for control group
log r(t) = log r0(t) + b1x1
If we have several additional explanatory variables, we simplygeneralize our regression model accordingly:
log r(t) = b0(t) + b1x1 + b2x2 + · · ·+ bkxk.
Here, b0(t) describes how the (log-transformed) rate depends on time for
all values of the explanatory variables in the model
Other types of regression 66
Choose a relevant time scale!
• The advantage of the Cox model is that it allows for any kind ofrelation between the rate and the underlying time scale, but theratio between the rates for any two patients at any particulartime point is only allowed to depend upon the covariates.
• Characteristic of a relevant time scale: There must be a goodreason to assume that time since time=0 has a large (and“identical”) effect on the rate for all patients — otherwise aconstant underlying rate is the only meaningful possibility, andin that case, the data can be better utilized by performing aPoisson regression.
Other time scales may enter as covariates in the Cox model. If thedependence on another time scale cannot be assumed to follow thepattern “one year more always means the same thing”, then you mustuse time-dependent covariates or stratify.
Other types of regression 67
Time scales
Examples of time scales• age
• calendar time
• time since beginning of a disease
• time from some other event of great importance for the rate (heretime from termination of latest bleeding)
• time from randomization (problematic if one of the treatments isplacebo)
The only difference for the single individual is the definition oftime=0, but it can make a big difference for the results, because ithas an influence on which individuals that are considered “at risk”when something happens.
Other types of regression 68
Example of survival data (Altman, 1991).
Other types of regression 69
Example of survival data (Altman, 1991).
Other types of regression 70
Example: Randomized study of the effect of sclerotherapy
An investigation of 187 patients with bleeding oesophagus varices caused by
cirrhosis of the liver (EVASP study). During the hospital admission for the
first variceal bleeding, the patients were randomized into one of two groups:
1. standard medical treatment (n=94)
2. standard treatment supplemented with sclerotherapy (n=93)
• We want to investigate whether sclerotherapy changes the risk of
re-bleeding (after cessation of first bleeding, by definition)
• Delayed entry at time of randomization because time=0 when first
bleeding ceases, which may be before randomisation. Patients
rebleeding before randomization cannot be entered into the study [so a
rebleeding before randomisation cannot be observed in the study]
• We also have an important covariate bilirubin (measures liver function)
Other types of regression 71
PROC PHREG DATA=scl;
MODEL tnotbld*bld(0) = log2bili sclero
/ ENTRYTIME=t_entry RISKLIMITS;
RUN;
Model Information
Data Set WORK.SCL
Entry Time Variable t_entry
Dependent Variable tnotbld
Censoring Variable bld
Censoring Value(s) 0
Ties Handling BRESLOW
Percent
Total Event Censored Censored
149 86 63 42.28
:
Analysis of Maximum Likelihood Estimates
Parameter Standard Hazard 95% Hazard Ratio
Variable Estimate Error Chi-Sq. Pr>ChiSq Ratio Confidence Limits
log2bili 0.43431 0.09580 20.5534 <.0001 1.544 1.280 1.863
sclero -0.16470 0.21682 0.5770 0.4475 0.848 0.555 1.297
Other types of regression 72
Model control in the Cox model
The Cox model is based on theassumption of proportionalrates, so R(t;X) = R0(t)ebX
andln(R(t;X)) = ln(R0(t)) + bX
Graphical check of proportionalrates: Stratify for each variableseparately and plotln(Rstratum(t)) =ln(− ln(Sstratum(t))),the curves should beapproximately parallel.
Other types of regression 73
Other types of censored data: Detection limit
Measurements of NO2 indoor and outdoor
85 pairs of measurements ofNO2
1. outside front door
2. in the bedroom
with a detection limit of 0.75.(Raaschou-Nielsen et al., 1997).
How does indoor concentrationdepend on outdoorconcentration?
Other types of regression 74
For the concentrations below the detection limit, we only know theupper limit (the upper limit is used in the plot).
If the upper limits are used as the observations in the statisticalanalysis, the estimated association will be biased!
With SAS, it is possible to obtain correct estimates of the association.
Other types of regression 75
Example of SAS programming statements
DATA no2; SET no2;
upper_limit = indoor;
IF indoor=0.75 THEN lower_limit = .;
ELSE lower_limit = indoor;
* No outdoor measurement below detection limit ;
outdoor_25=outdoor-2.5; * median(outdoor)=2.5 ;
RUN;
PROC LIFEREG DATA=no2;
MODEL (lowlim, indoor) = outdoor_25
/ DIST=NORMAL NOLOG;
RUN;
(CLASS-statement can be used)
Other types of regression 76
The LIFEREG Procedure
Model Information
Data Set WORK.NO2
Dependent Variable lower_limit
Dependent Variable upper_limit
Number of Observations 85
Noncensored Values 60
Right Censored Values 0
Left Censored Values 25
Interval Censored Values 0
Name of Distribution Normal
Log Likelihood -35.88065877
Algorithm converged.
Type III Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
outdoor_25 1 177.8626 <.0001
Analysis of Parameter Estimates
Standard 95% Confidence
Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq
Intercept 1 1.5203 0.0431 1.4359 1.6047 1245.07 <.0001
outdoor_25 1 0.7845 0.0588 0.6692 0.8997 177.86 <.0001
Scale 1 0.3403 0.0320 0.2830 0.4092
Other types of regression 77
Estimation of standard deviation
scale=maximum likelihood estimate of the standard deviation (SD)
To obtain a statistic comparable to the usual estimate (“ROOT MSE” inSAS output) some adjustment for the degrees of freedom is necessary:
SD = scale ·√
n
n− k − 1
where n = number of observations, and k = number of estimatedparameters (not counting the intercept or the scale parameter).
In the example SD= 0.340 ·√
8583 = 0.344.