Lecture 9: Hypothesis Testing One sample tests >2 sample.

Lecture 9: Hypothesis Testing

One sample tests>2 sample

Hypothesis Testing for One-Sample

• Standard set-up

• What is q ? • Common approach– Assume distribution is exponential– Test that distribution is exponential with q = q0

0 0

0

:

:A

H

H

Pretty Stringent

• Actually

• As long as the hazard is specified for the range of t, tests can be performed

0 0

0

:

:A

H t t

H t t

General Form of Test

0

01 0

0

~ 0,1

i

i

D di Yi

h s

Y s

Z O E W t W s h s ds

V Z W s ds

ZN

V Z

Log-Rank

• W(ti) = Y(ti)

01 1

01

01 1

01

~ 0,1

D n

i ji j

n

jj

D n

i ji j

n

jj

Z d H T

V Z H T

d H TN

H T

Accounting for Left-Truncation

• Choice of weights is still W(t) = Y(t)

1

0 01

0 01 1

0 01

~ 0,1

D

ii

n

j jj

D n

i j ji j

n

j jj

O d

E V Z H T H L

d H T H LN

H T H L

Other Options

• Harrington and Fleming– Allows user to have flexibility in weighting– Can choose early or late departures to be more influential– Special case: Gehan-Wilcoxon– Harrington DP and Fleming TR (1982). A class of rank test

procedures for censored survival data. Biometrika 69, 553-566.• Gatsonis• Interesting aside

– Log-rank first introduced for one-sample testing by Breslow (1975)

– Extended to left-truncation by Hyde (1977) and Woolson (1981).

Notes

• An estimator of the variance, V, can be the empirical estimate rather than the hypothesized value

• When the alternative, h(t) > h0(t) is true, this variance estimator is expected to be larger and the test less powerful

• If h(t) < h0(t) then this variance will be smaller and the test more powerful

Example: Rheumatoid Arthritis

• 10 white males with RA followed for up to 18 years

• Objective: – Determine if men with RA are at greater risk of

mortality

Entry Time Exit Time di

43 51 0

44 54 0

45 51 0

45 60 0

48 61 0

49 55 0

50 59 1

51 69 1

53 68 0

54 70 0

Bone Marrow Transplant for Leukemia

• Patient undergoing bone marrow transplant (BMT) for acute leukemia

• Three types of leukemia– ALL– AML low risk– AML high risk

• What if we are interested in overall incidence rate (i.e. either relapse or death)

BMT Example

• Want to test whether or not survival in BMT patients follows an exponential distribution– What does this mean we are asking?

• Can estimate l from the data (recall the MLE for an exponential distribution)

R Code### BMT exampledata<-read.csv("H:\\BMTRY_722_Summer2013\\BMT_1_3.csv")

failtime<-ifelse(data$Relapse==0 & data$Death==0| data$Relapse==1, data$TTR, NA)failtime<-ifelse(data$Death==1 & data$TTR>=data$TTD, data$TTD, failtime)event<-ifelse(data$Relapse==1| data$Death==1, 1, 0)

st<-Surv(failtime, event)fit<-survfit(st~1)plot(fit, xlab="Time", ylab="S(t)", lwd=2)

#Calculating lambda hatlambda.hat<-sum(event)/sum(failtime)

“survdiff” FunctionDescription

Tests if there is a difference between two or more curves using the G-rho family of tests, or for a single curve against a known alternative

Usagesurvdiff(formula, data, subset, na.action, rho=0)

Argumentsformula: a formula expression as for other survival models, of the form Surv(time, status)~predictors. For a one-sample test, the predictors must consist of a single offset(sp) term, where sp is a vector giving the survival probability for each subject

“survdiff” Function

MethodThis function implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)^rho, where S is the Kapalan-Meier estimate of survival. With rho=0 this is the log-rank or Mantel-Haenszel test, and with rho=1 it is the equivalent to the Peto & Peto modification of the Gehan-Wilcoxon test.

If the right hand side of the formula consists only of an offset term, then a one sample test is done. To cause the missing values in the predictors to be treated as a separate group, rather than being omitted, use a factor function with its exclude argument.

R code#Estimating lambda >lambda.hat<-sum(event)/sum(failtime)# Expected S(t) = exp(-lambda.hat*t)> S.exp<-exp(-lambda.hat*failtime)

> one.sample.test<-survdiff(st~offset(S.exp))> one.sample.test1Observed Expected Z p 83 83 0 1> one.sample.test2<-survdiff(st~offset(S.exp), rho=1)> one.sample.test2Observed Expected Z p 83 83 0 0.00521#Comparing hypothesized dist’n to empirical dist’n> plot(fit, conf.int=F, lwd=2)> lines(sort(failtime), rev(sort(S.exp)), col=2, lwd=2, type="s")

R code#Estimating lambda for failure times <800> fail2<-failtime[which(failtime<800)]> event2<-event[which(failtime<800)]> lambda.hat2<-sum(event2)/sum(fail2)# Expected S(t) = exp(-.004*t)> S.exp2<-exp(- lambda.hat2 *fail2)> st2<-Surv(fail2, event2); fit2<-survfit(st2~1)

> one.sample.testa<-survdiff(st2~offset(S.exp2))> one.sample.testaObserved Expected Z p 80 80 0 1> one.sample.testb<-survdiff(st2~offset(S.exp2), rho=1)> one.sample.testbObserved Expected Z p 80 80 0.000 0.477

R code#Estimating lambda for failure times >800> fail3<-failtime[which(failtime>=800)]> event3<-event[which(failtime>=800)]> lambda.hat3<-sum(event3)/sum(fail3)# Expected S(t) = exp(-.004*t)> S.exp3<-exp(- lambda.hat3*fail3)> st3<-Surv(fail3, event3); fit3<-survfit(st3~1)

> one.sample.testc<-survdiff(st3~offset(S.exp3))> one.sample.testc Observed Expected Z p 3 3 -2.56e-16 1 > one.sample.testd<-survdiff(sts~offset(S.exp3), rho=1)> one.sample.testdObserved Expected Z p 3 3 -0.035 0.9730

Conclusions

• So what can we conclude about our original hypothesis?

Relevance

• Becoming more common• Phase II cancer studies with TTE outcomes

instead of response• But– Often more interested in median or 1 year survival

• Yet– Very important for sample size considerations– Most often assume study data will have

exponential distribution for sample size

On to something more interesting… comparing >2 samples

Comparing two or more samples

• Anova type approach

– Where t is the largest time for which all groups have at least one subject at risk

• Data can be right-censored (and left truncated) for the tests we will discuss

0 1 2: , for all

: at least on of is different for some

K

A j

H h t h t h t t

H h t t

Notation

• Let t1 < t2 < … < tD be distinct death times in all samples being compared

• At time ti , let dij be the number of events in group j out of Yij individuals at risk (j = 1,2,…,K)

• Define1

1

K

i ijj

K

i ijj

d d

Y Y

Rationale

• Weighted comparisons of the estimated hazard of the jth population under the null hypothesis and alternative hypothesis

• Based on Nelson-Aalen estimator• If the null is true, the pooled estimate of h(t)

should be an estimator for hj(t)

Applying the Test

• Let Wj(t) be a positive weight function s.t. Wj(t) = 0 if Yij = 0

• If all Zj(t)’s are close to zero, then little evidence to reject the null

1

1,2,...,

ij i

ij i

D d dj j Y Yi

Z W t

for j K

Common Form for Weight Functions

• All commonly used tests choose weight functions s.t.

• Note that weight is common across all j• Can redefine Z:

j ij iW t Y W t

1i

i

D dj i ij ij Yi

Z W t d Y

Test Statistic

• Variance and covariance of Zj(t) (K&M p. 207)

• Z1(t) , Z2(t) , ..., ZK(t) are linearly dependent because their sum is 0

• For test statistic, choose K – 1 components• Chi-square test with K – 1 d.f. where S-1 is the

variance-covariance matrix

'2 11 1 2 1 1 2 1, ,..., , ,...,K K KZ Z Z Z Z Z

Log-Rank Test for 2 Groups

• For log-rank W(ti)=1• Have 2 groups and want to test if survival is the same

in the groups

• We want to develop a nonparametric test of

0

1

Group0 : ~

Group1: ~

i

i

T F

T F

0 0 1 0 0 1 0 0 1

0 1 0 1 0 1

: : :

: : :A A A

H F F H S S H h h

H F F H S S H h h

Log-Rank Test for 2 Groups• If and follow some parametric distribution

and are in the same family, this is easy • For example assume

• But need a test whose validity doesn’t depend on parametric assumptions

0F 1F

jt

jS t e

Constructing the Log-Rank Test

• Recall our notation– t1 < t2 < … < tD are D distinct ordered event times

– Yij = # people in the group j at risk at ti

– Yi = # people at risk across groups at ti

– dij = # of people in group j that fail at ti

– di = # of people in across groups that fail at ti


• We can summarize the information at time ti in a 2x2 table

Fail Don’t Fail

Group 0

Group 1

Toy Example

• Say we have the following data on two groups:

• We want to test the hypothesis

Group0 : 3,6 ,9,9,11 ,16

Group1: 8,9,10 ,12 ,19,23

c

c

0 0 1

0 1

:

:A

H h t h t

H h t h t

Toy Example

Same Test in R> time<-c(3,6,9,9,11,16,8,9,10,12,19,23)> cens<-c(1,0,1,1,0,1,1,1,0,0,1,0)> grp<-c(1,1,1,1,1,1,2,2,2,2,2,2)> grp<-as.factor(grp)> > sdat<-Surv(time, cens)> survdiff(sdat~grp)Call:survdiff(formula = sdat ~ grp)

N Observed Expected (O-E)^2/E (O-E)^2/Vgrp=1 6 4 2.57 0.800 1.62grp=2 6 3 4.43 0.463 1.62

Chisq= 1.6 on 1 degrees of freedom, p= 0.203

Same Test in R> names(toy)[1] "n" "obs" "exp" "var" "chisq" "call"

> toy$obs[1] 4 3

> toy$exp[1] 2.566667 4.433333

> toy$var [,1] [,2][1,] 1.267778 -1.267778[2,] -1.267778 1.267778

> toy$chisq[1] 1.620508

UMP Tests

More general: 2 samples

• We can change the weight function• For K = 2, can use Z-score or c

2

1 1

1 11

2

11

~ 0,11

i

i

i i i i

i i i

D di i i Yi

D Y Y Y di iY Y Yi

W t d YZ N

W t d

Corrects for ties

Choice for Weight Functions

• W(t) = 1– Log-rank test– Optimal power for detecting differences when hazards

are proportional

• Wi(t) = Yi

– Gehan test– Generalization of 2-sample Mann-Whitney-Wilcoxon

test

1i

i

D dj ij ij Yi

Z d Y

1i

ij

D Yj ij iYi

Z d d

Choices for Weight Functions

• Fleming-Harrington– General case

– Special cases• Log-rank: q = 0• Mann-Whitney-Wilcoxon: p = 1, q = 0• q = 0, p > 0: gives greater weight to early departures• p = 0, q > 0: gives greater weight to late departures

– Allows specific choice of influence (for better or worse!)

, 1 1ˆ ˆ1

qp

p q i i iW t S t S t

Others?

• Many• Not all available in all software (e.g. Gehan not

in R)• Worth trying a few in each situation to

compare inferences

Caveat

• Note we are interested in the average difference (consider log-rank specifically)

• What if hazards cross?• Could have significant difference prior to some

t, and another significant difference after t: but what if direction differs?

Next time

• More on different weight functions• Tests for trends

Lecture 9: Hypothesis Testing One sample tests >2 sample.

Documents

Transcript of Lecture 9: Hypothesis Testing One sample tests >2 sample.