De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal...

Post on 15-Jul-2019

214 views 0 download

Transcript of De nihilo nihil - Christian-Albrechts-Universität zu … nihilo nihil Statistical Modelling Causal...

De nihilo nihil

Statistical Modelling

Causal Relationships

responseblood pressure

disturbing factorbody weight

disturbing factorcigarette smoke

explanatorycaffeine intake

causation association

Statistical Modelling

response or dependent variables and

explanatory or independent variables,

including adjustment for

uncontrollable disturbing factors.

... entails the analysis of the functional relationship between

Experimental Modelling experimental evaluation of the effects of given explanatory variables upon a response variable, involving either randomisation or matching (or 'control') for known disturbing factors (e.g. temperature and humidity as

determinants of the adhesion of dental prostheses)

Observational Modellingobservation-based analysis of the relationship between a response variable and several explanatory variables(e.g. birth weight and gestational age)

Statistical Modelling Basic Approaches

Y: response variableX1,...,Xk: explanatory variables

Ε : random error

Ε+++++= kk2211 xb...xbxbaY

Linear Models

Ε is generally assumed to be N(0,σ2) with unknown σ

Use of Multiple Linear (and other) Models allows regression coefficients bi to be estimated while taking the influence of

disturbing factors into account ('adjustment').

Linear Models

0 E(Y)

ypred=a+b1x1+...+bkxk

Ε Y

body height (inches)

62 64 66 68 70 72

body

weig

ht

(pounds)

90

100

110

120

130

140

150

y: body weight (pounds), x1: body height (inches)

ypred=-111.29+3.44⋅x1

Miss America Body Features 1984 - 2002

1. Data Exploration: isolated assessment of the possible relevance of each explanatory variable

2. Model Formulation: mathematical modelling of the multifaceted relationship between explanatory and response variables, invoking scientific plausibility

3. Model Selection: parameter estimation ('regression'), hypotheses testing (e.g. likelihood ratio, p value, coefficient of determination)

4. Model Checking: comparison between model predictions and observations ('residual diagnostic')

Statistical Modelling Procedure

Prediction of Body Fat Percentage

Body fat percentage can be determined by dual energy X-ray absorptiometry (DXA), a fairly accurate but time-

consuming and expensive technique. On the other hand, measurement of triceps skin fold thickness, thigh and

mid arm circumference may not be as accurate as DXA, but are quicker and cheaper to perform.

from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models

Prediction of Body Fat Percentage

explanatoryskin fold

responsebody fat

explanatorythigh

explanatory mid arm

from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models

body fat (%)Y

skin fold (mm)X1

thigh (cm)X2

mid arm (cm)X3

11.9 19.5 43.1 29.122.8 24.7 49.8 28.2

18.7 30.7 51.9 37.0

20.1 29.8 54.3 31.1

12.9 19.1 42.2 30.9

21.7 25.6 53.9 23.7

27.1 31.4 58.5 27.6

Variables Y, X1,...,X3 were measured simultaneously in 20 individuals.

Prediction of Body Fat Percentage

from: J. Neter, W. Wasserman, M.H.Kutner (1997) Applied Linear Statistical Models

...

Multiple Linear Regression

pair-wise Pearson correlation coefficients r (upper right half) and two-sided p values for r=0 (lower left half)

Data Exploration

Y X1 X2 X3

Y

X1

X2

X3

0.843 0.878 0.142

<0.001 0.924 0.458

<0.001 <0.001 0.085

0.549 0.042 0.723

^

skin fold thickness (mm)

10 15 20 25 30 35

body

fat (

%)

10

15

20

25

30

y: body fat (%)x1: skin fold thickness (mm)

ypred=-1.496+0.857⋅x1

R2=0.711

Multiple Linear RegressionData Exploration

thigh circumference (cm)

40 45 50 55 60

body

fat (

%)

10

15

20

25

30

Multiple Linear RegressionData Exploration

y: body fat (%)x2: thigh circumference (cm)

ypred=-23.634+0.857⋅x2

R2=0.771

mid arm circumference (cm)

20 25 30 35 40

body

fat (

%)

10

15

20

25

30

Multiple Linear RegressionData Exploration

y: body fat (%)x3: mid arm circumference (cm)

ypred=14.687+0.199⋅x3

R2=0.020

Ε++++= 332211 xbxbxbaY

linear model with normal error Ε

Multiple Linear RegressionModel Formulation

Backward Selection: stepwise reduction of the number of explanatory variables, starting with the "full" model

Model Selection

Forward Selection: stepwise inclusion of explanatory variables, starting with the best variable (e.g. that with the smallest p value)

Parameter estimation from model equations using maximum likelihood or least square methods

2020,3320,2220,1120

22,332,222,112

11,331,221,111

xbxbxbay

xbxbxbay

xbxbxbay

ε++++=

ε++++=

ε++++=

M

Multiple Linear RegressionModel (Backward) Selection

a (intercept) 117.085 99.782

b1 (skin fold) 4.334 3.016

b2 (thigh) -2.857 2.582

b3 (mid arm) -2.186 1.595

term estimate s.e.

ypred=117.085+4.334⋅x1-2.857⋅x2 -2.186⋅x3

R2= 0.895

Multiple Linear RegressionFull Model

For each regression coefficient bi, test the null hypothesis Hi,0: bi=0 against the alternative Hi,A: bi≠0 using, for example, a Wald test.

)b̂.(e.s

b̂W

i

ii =

Since Wi∼N(0,1) under Hi,0, reject Hi,0 if |Wi |> z1-α/2.

Multiple Linear RegressionModel (Backward) Selection

a (intercept) 1.173 0.258

b1 (skin fold) 1.437 0.170

b2 (thigh) -1.106 0.285

b3 (mid arm) -1.370 0.190

term W p

Multiple Linear RegressionModel (Backward) Selection

a (intercept) 6.792 4.488

b1 (skin fold) 1.001 0.128

b3 (mid arm) -0.431 0.177

term estimate s.e.

ypred=6.792+1.001⋅x1 -0.431⋅x3

R2= 0.887

Multiple Linear RegressionFinal Model

a (intercept) 1.513 0.149

b1 (skin fold) 7.803 <0.001

b3 (mid arm) -2.442 0.026

term W p

Multiple Linear RegressionFinal Model

body fat (%)

10 15 20 25 30

stan

dard

ized

res

idua

l

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

predyy

ipred,ii

s

yy

−=ε

verification whether (random) error Ε is N(0,σ2)

'standardized residuals'

Multiple Linear RegressionModel Checking

response variable

resi

dual

0

response variable

resi

dual

0

resi

dual

0

response variable

response variable

resi

dual

0

(a)

(b)

(c)

(d)

Multiple Linear RegressionModel Checking

Analysis of Variance (ANOVA) explanatory variables are either qualitative or quantitative, but discrete

Analysis of Covariance (ANCOVA)some explanatory variables are continuous, some are discrete (multiple regression)

Other (Normal) Linear Models

Y: response variableX1,...,Xk: explanatory variablesΕ: N(0,σ2) with unknown σ

Ε+++++= kk2211 xb...xbxbaY

Linear Models

kk2211 xb...xbxba)YE( ++++=

)(Exb...xbxbaE(Y) kk2211 Ε+++++=

Y: response variableX1,...,Xk: explanatory variablesG: link function

kk2211 xb...xbxba(Y)]EG[ ++++=

Generalised Linear Models

for a dichotomous response variable Y:E(Y) = 0⋅P(Y=0)+1⋅P(Y=1) = P(Y=1) =π

x

0.0 0.2 0.4 0.6 0.8 1.0

logit(x

)

-6

-4

-2

0

2

4

6

kk2211 xb...xbxba)logit( ++++=π

Generalised Linear Model with the 'logit' as the link function

Logistic Regression

)x1

xln(logit(x)

−=

Logistic Regression

Let X1 be a dichotomous explanatory variable (e.g. 1:"exposed", 0:"not exposed")

)bexp(OR 1=

kk221e xb...xb1ba)logit( +++⋅+=π

kk221n xb...xb0ba)logit( +++⋅+=π

)1

ln()1

ln()logit(-)(itlogbn

n

e

ene1 π−

π−π−

π=ππ=

)ORln(1

/1

lnn

n

e

e =

π−π

π−π=

Adjusted Odds Ratio

The Evans County Heart Study

In 1960, the entire population of Evans County, Georgia, aged 40 and over were given a complete cardiovascular

examination. Some 609 white males were followed for 9 years to determine their coronary heart disease (CHD) status.

Hames C (1971) Arch Intern Med 128: 883-886.

The Evans County Heart Study

Y: CHD status (dichotomous)0:"no", 1:"yes"

x1: catecholamine level (CAT; dichotomous) 0:"low", 1:"high"

x2: age (years) x3: cholesterol (CHL; mg/dL) x4: smoking status (dichotomous)

0:"never smoker", 1:"ever smoker"x5: hypertension (dichotomous)

0:"no", 1:"yes"x6: ECG abnormalities (dichotomous)

0:"no", 1:"yes"

from: Kleinbaum DG (1994) Logistic Regression - A Self-Learning Text. Springer, New York

CAT (%) 95 (18%) 27 (38%) <0.001

age 53 ± 9 57 ± 10 0.002

CHL 210 ± 39 222 ±39 0.021

smoking (%) 333 (62%) 54 (76%) 0.025

hypertension (%) 212 (39%) 43 (60%) <0.001

ECG (%) 137 (26%) 29 (41%) 0.010

explanatory variable no (n=538) yes (n=71) p

CHD

Data Exploration

Logistic Regression

number and percentage, or mean±s.e., with p values from χ2-test or t-test, respectively

The Evans County Heart Study

Unadjusted Odds Ratios

44low 443

27high 95

CHD ∅ CHD

17no 205

54yes 333

CHD ∅ CHD

28no 326

43yes 212

CHD ∅ CHD

42no 401

29yes 137

CHD ∅ CHD

CAT

ORCAT=27⋅443/95⋅44=2.86 ORsmoke=54⋅205/333⋅17=1.96

ORhyp=43⋅326/212⋅28=2.36 ORECG=29⋅401/137⋅42=2.02

Smoking

Hypertension ECG abnormality

Model Formulation

Logistic Regression

662211 xb...xbxba)logit( ++++=π

logistic model with π=E(Y) equal to the 9-years incidence proportion (or "risk") of CHD

Logistic RegressionThe Full Model

a (intercept)

b1 (CAT)

b2 (age)

b3 (CHL)

b4 (smoking)

b5 (hypertension)

b6 (ECG)

term estimate s.e.

-6.772

0.598

0.032

0.009

0.834

0.439

0.369

1.140

0.352

0.015

0.003

0.305

0.291

0.294

The Evans County Heart Study

Adjusted versus Unadjusted Odds Ratios

b1 (CAT)

b4 (smoking)

b5 (hypertension)

b6 (ECG)

term estimate

0.598

0.834

0.439

0.369

odds ratio

adjusted unadjusted

1.82

2.30

1.55

1.49

2.86

1.96

2.36

2.02

Logistic RegressionModel (Backward) Selection

a (intercept)

b1 (CAT)

b2 (age)

b3 (CHL)

b4 (smoking)

b5 (hypertension)

b6 (ECG)

term W p

<0.001

0.089

0.034

0.007

0.006

0.131

0.208

-5.940

1.698

2.123

2.680

2.734

1.509

1.258

Logistic RegressionThe Final Model

a (intercept)

b2 (age)

b3 (CHL)

b4 (smoking)

term estimate s.e.

-7.027

0.051

0.007

0.851

1.107

0.014

0.003

0.301

logit(π) = -7.027+0.051⋅x2+0.007⋅x3+0.851⋅x4

ORsmoke unadjusted: 1.96, adjusted: 2.34

x

-10 -5 0 5 10

logit

-1(x

)

0.0

0.2

0.4

0.6

0.8

1.0

Logistic Function (logit-1)

)xexp(1

1(x)logit 1-

−+=

)xb...xbxbexp(-a1

1

kk2211 −−−−+=π

What is the 9-years CHD risk of a 45 year old ever-smoker with a cholesterol level of 260 mg/dL?

The Evans County Heart Study

x2=45, x3=260, x4=1

)1851.0260007.045051.0027.7exp(1

1

⋅−⋅−⋅−+=π

113.0)061.2exp(1

1 =+

=

Logistic RegressionScreening Test

The comparison of the individual risk, π, with a given threshold, ρ, provides a "screening"

test for the disease.

π

>ρ ≤ρ

test positive

test negative

Logistic RegressionScreening Test (ROC Curve)

1-sensitivity

0 1

1

specificity

0.32

ρ: 0.11

sensitivity: 0.68specificity: 0.61Youden's index: 0.29

baseline risk: 71/(71+538)=0.12PPV: 0.19NPV: 0.93

AUC: 0.68

0.61

The triple test is done between the 16th and 18th weeks of pregnancy. The test measures three substances, or markers, that are passed from the fetus and the placenta into the mother's bloodstream - AFP, human chorionic gonadotropin and unconjugated estriol. [...] A method was found

to combine results of the three tests with a mother's age to identify women at increased risk for having a baby with Down's syndrome. Since that time, a number of studies have shown that the triple test can detect 60 to 70 percent of Down's syndrome cases. Because it is a screening test, the triple test identifies pregnancies that are at increased risk, or "screen-positive" for Down's syndrome. A positive result does not necessarily mean the baby is affected, but is only a signal for further testing.

"Triple Test" for Down Syndrome

American Society of Clinical Pathology (www.ascp.org)

Summary

- Statistical modelling entails the analysis of the functional relationship between response and explanatory variables.

- Experimental modelling is based upon prospective trials, addressing controlled explanatory variables. Observational modelling makes use of uncontrolled, observational data.

- Statistical modelling proceeds in multiple steps, including data exploration, followed by model formulation, selection and checking.

- The most commonly used class of statistical models are generalised linear models, encompassing (multiple) linear regression, analysis of variance and logistic regression.

- Multiple models 'adjust' the effect of explanatory variables for any bias introduced by disturbing factors.