JHU Job Talk

A General Framework for Multiple Testing Dependence

Jeffrey Leek Johns Hopkins University School of Medicine

High-dimensional multiple hypothesis testing is common.

Problem: Dependence between tests can result in incorrect statistical and scientific results.

A solution: Define and address multiple testing dependence at the level of the data – not the P-values.

Big Picture Ideas

High-Dimensional Multiple Testing Is Common

Spatial Epidemiology Brain Imaging

Molecular Biology

4

Inflammation and the Host Response to Injury

mRNA Expression

~50,000 genes

Clinical Data >150

clinical variables

Patient 1 Patient 2 Patient 166 ….

MOF measures severity of

injury

Data at Initial Time Point

Multiple Organ Failure

Simple Analysis

1. Fit the model to the data, xi, for gene i:

xi = ai + biMOF + ei

2. Calculate P-values for testing the hypotheses:

H0: bi = 0 vs. H1: bi ≠ 0

3

Four “Replicated” Studies

Phase 1

Phase 3

Phase 2

Phase 4

P-value P-value

P-value P-value

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

• Data for test i:

• “Primary variable(s)”:

• Model:

• Hypothesis test i:

€

x i = xi1,xi2,…,xin( )

€

Y = y1,y2,…,yn( )

€

xij = ai + biksk y j( )k=1

d

∑ + eij

€

H0i :bi ∈ Ω0 H1i :bi ∈ Ω1

{m hypothesis tests, n observations per test}

Start With The Whole Data

= +

X = B S(Y) + E

observations

test

s Underlying Model

A Simple Simulated Example

Independent E Dependent E

Gen

es

Gen

es

Arrays Arrays

Null P-Value Distributions

Independent E

Dependent E

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

P-value P-value P-value P-value


Null P-Value Distributions |ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00 Correlation

Independent E

Dependent E

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y



Null Distribution Behavior

Dependent E

Independent E

False Discovery Rate Estimates


Ranking Estimates


Data X

Fit Model X= BS + E

Obtain and R

€

ˆ B

Calculate P-values

Form P-value Threshold

When To Address Dependence?

Form Test-Statistics and

Null Distribution

Data X

Fit Model X= BS + E

Obtain and R

€

ˆ B

Calculate P-values




Null Distribution

Existing Approaches

Empirical null approaches modify the null distribution at the test-statistic level

Dependence adjustments conservatively modify the P-value threshold

Examples of Existing Approaches

• Empirical Null – Devlin and Roeder Biometrics (1999) – Efron JASA (2004) – Schwartzman AOAS (2008)

• Error Rate Adjustments – Benjamini and Yekutieli Annals of Statistics (2001) – Romano, Shaikh, and Wolf Test (2001) – Dudoit, Gilbert, van der Laan Biometrical Journal (2008)

Data X

Fit Model X= BS + E

Obtain and R

€

ˆ B

Calculate P-values




Null Distribution

Our Approach

Fit the model: X = BS + ΓG + U

where G is a valid dependence kernel

Dependence and bias are no longer present at any of these steps; standard methods can be used.

Data X

Fit Model X= BS + E

Obtain and R

€

ˆ B

Calculate P-values




Null Distribution

Our Approach

Fit the model: X = BS + ΓG + U

where G is a valid dependence kernel

New Dependence Definitions

Definition – Data X are population-level multiple testing dependent if:

Definition - Data X are estimation-level multiple testing dependent if:

Leek and Storey (2008)

Structure in E

Array

MO

F1

Gen

es

Signal + Dependent Noise

Dependent Noise

Independent Noise

= +

X = B S + E

observations

test

s

data random variation

primary variables

Decomposing E

= +

X = B S + H + U

test

s +

independent variation

observations

data primary variables

dependent variation

Decomposing E

= +

X = B S + Γ G + U

test

s +


observations


dependence kernel

Decomposing E

H

Decomposing E Theorem Let the data be distributed according to the model:

Suppose that for each ei there is no Borel measurable function, g, such that ei =g(ei,…,ei-1,ei+1,…,em) almost surely. Then there exist matrices Γ(m×r), G(r×n) (r ≤ n) and U(m×n) such that:

where the rows of U are independent and ui ≠ 0 and ui=hi(ei) for a non-random Borel measurable function hi.


Dependence Kernel


Definition – Dependence Kernel An r ×n matrix G forms a dependence kernel for the data X, if the following equality holds:

X = BS + E = BS + ΓG + U where the rows of U are independent.

Fitting S & G Results In Independent Tests


Theorem Let G be any valid dependence kernel for the data X. Suppose that the model:

is fit by least squares resulting in residuals:

if the rowspace jointly spanned by S and G has dimension less than n, then the ri and the are jointly independent given S and G and:

€

ˆ b i

= +

X = B S + Γ G + U

test

s +


observations


dependence kernel

A “Blessing” of Dimensionality

Iteratively Reweighted Surrogate Variable Analysis

1. Estimate the row dimension, , of G. 2. Form an initial estimate equal to the first right

singular vectors of R = X - S.

3. Estimate . 4. Weight the ith row of X by and

set to be the first right singular vectors of the weighted matrix.

€

ˆ G (b+1)

€

ˆ r

€

ˆ B

Iterate for b=0,…,B: €

ˆ G 0

€

ˆ r

€

X = BS+ ΓG +U

€

x i = biS+ γ iG + uiWhole data:

Test i data:

€

ˆ r

An Example of the IRW-SVA Algorithm The Data True G Estimate of G Pr(G & !S)

Iteratively Re-weighted Surrogate Variable Analysis





€

ˆ G (b+1)

€

ˆ r

€

ˆ B

€

ˆ G 0

€

ˆ r

€

X = BS+ ΓG +U

€


Test i data:

€

ˆ r

Iterate for b=0,…,B:

1. Buja and Eyuboglu (1992) proposed a permutation approach.

2. Patterson, Price, and Reich (2006) proposed a sequential testing strategy based on Tracey-Widom theory.

3. Leek (in preparation) proposes an eigenvalue estimator that is consistent in the number of tests.

Estimating The Row Dimension of G

1. Assume the data follow X = BS + ΓG + U, where G and S have row dimensions r and d, r + d < n.

2. Calculate the singular values s1,…, sn of X and choose b, such that r+d < b.

3. Calculate the eigenvalues, λ1,…, λn of where P = I - S(STS)-1ST and R = XP.

4. Set

€

ˆ r = 1 λ j > m−1/ 3( )j=1

n

∑

€

€

1mRTR− sb

2P[ ]


Theorem As ,

is a consistent estimate of the row dimension of G, provided that: (1) uij are independent (2) E[uij]=0 (3) (4) (5) ΓTΓ is positive definite with unique eigenvalues

€

m→∞

€

E[uij2 ] =σ i

2 < M1

€

E[uij4 ] < M2

€

limm→∞

1m

Leek (In Prep.)

€

ˆ r = 1 λ j > m−1/ 3( )j=1

n

∑


Iteratively Re-weighted Surrogate Variable Analysis





€

ˆ G (b+1)

€

ˆ r

€

ˆ B

€

ˆ G 0

€

ˆ r

€

X = BS+ ΓG +U

€


Test i data:

€

ˆ r

Iterate for b=0,…,B:

Break The Estimation Into Two Components

1. Form F-statistics F1,…,Fm for testing the hypotheses:

2. Bootstrap from the conditional null model to obtain null-statistics , k =1,…K.

3. From Bayes’ Theorem:

where and .

Estimating the Probability Weights

€

F10k,...,Fm

0k

€

Fi0k ~ g0

€

Fi ~ π 0g0 + (1−π 0)g1




4. Estimate the ratio of the densities with a non-parametric logistic regression where Fi are “successes” and Fi

0k are “failures” (Anderson and Blair 1982).

where and . .


€

F10k,...,Fm

0k

€

Fi0k ~ g0

€

Fi ~ π 0g0 + (1−π 0)g1




4. Estimate the ratio of the densities with a non-parametric logistic regression where Fi are “successes” and Fi

0k are “failures” (Anderson and Blair 1982).

5. Estimate π0 according to Storey (2002).

where and .


€

F10k,...,Fm

0k

€

Fi0k ~ g0

€

Fi ~ π 0g0 + (1−π 0)g1


Estimate of posterior probability bi ≠ 0.

SVA-Adjusted Analysis

1. Estimate G with IRW-SVA

2. Fit

3. Test the hypotheses

€

H0i :bi ∈ Ω0 H1i :bi ∈ Ω1

A Simple Simulated Example


Gen

es

Gen

es

Arrays Arrays

Null Distribution Behavior

Dependent E

Independent E

Dependent E + IRW-SVA

False Discovery Rate Estimates

Independent E Dependent E Dependent E + IRW-SVA

True False Discovery Rate True False Discovery Rate True False Discovery Rate

Q-v

alue

Q-v

alue

Q-v

alue

Ranking Estimates

Independent E Dependent E Dependent E + IRW-SVA

Ranking by True Signal to Noise Ranking by True Signal to Noise Ranking by True Signal to Noise

Aver

age

Ran

king

by

T-St

atis

tic

Aver

age

Ran

king

by

T-St

atis

tic

Aver

age

Ran

king

by

T-St

atis

tic

53

Inflammation and the Host Response to Injury

mRNA Expression

~50,000 genes

Clinical Data >150

clinical variables

Patient 1 Patient 2 Patient 166 ….

MOF1 measures severity of

injury

Phase 1 Phase 2 Phase 3 Phase 4

Four “Replicated” Studies Fr

eque

ncy

Freq

uenc

y



Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Functional Enrichment Across Phases

Number of phases in which a significant pathway appears

Perc

ent o

f tot

al si

gnifi

cant

pat

hway

s

1 of 4 2 of 4 3 of 4 4 of 4

Unadjusted IRW-SVA Adjusted

• High-dimensional hypothesis testing is common.

• Dependence between tests can result in incorrect statistical and scientific inference.

• We can define and address dependence at the level of the model using the dependence kernel.

• IRW-SVA can be used to improve inference in high-dimensional multiple hypothesis testing.

Summary

Future Work

• Multiple Testing – Develop dependence kernel estimates for spatial data – Develop diagnostic tests for multiple testing procedures

• High-Dimensional Asymptotics – Extend methods for asymptotic SVD to binary data

• Feature Selection for High-Dimensional Classifiers – Extensions of top-scoring pairs (TSP) to survival data – Theoretical connections to LDA and SVM – Embedding TSP in a logic regression framework

Thank You

1. Calculate the residuals R = X - S.

2. Calculate the singular values of R, d1,…,dn.

3. Permute each row of R individually to get R0. 4. Take the SVD of the residuals R* = R0 - S to

obtain null singular values . 5. Compare di to for k=1,…,K to calculate a P-

value for the ith right singular vector.


€

ˆ B

€

ˆ B 0

€

di0k

€

di0k

For k =1,…,K do steps 3-4:

Buja and Eyuboglu (1992)

Why Does This Work?

Leek and Storey (2007), Leek and Storey (2008)

Useful Fact:

X = BS + E = BS + ΓG + U = BS + ΛH + U if G and H have the same column space.

• References:

Benjamini Y and Hochberg Y. (1995), “Controlling the false discovery rate – a practical and powerful approach to multiple testing.” JRSSB, 57: 289-300.

De Castro MC, Monte-Mor RL, Sawyer DO, and Singer, BH. (2005), “Malaria risk on the amazon frontier.” PNAS, 103: 2452-2457.

Delin B and Roeder K. (1999), “Genomic control for association studies.” Biometrics, 55: 997-1004.

Efron B. (2004) “Large-scale simultaneous hypothesis testing: The choice of a null hypothesis.” JASA, 99: 96-104.

Leek JT and Storey JD. (2008) “A general framework for multiple testing dependence.” Proceedings of the National Academy of Sciences , 105: 18718-18723.

Leek JT and Storey JD. (2007) “Capturing heterogeneity in gene expression studies by ‘Surrogate Variable Analysis’.” PLoS Genetics, 3: e161.

Taylor JE and Worsley KJ. (2007) “Detecting sparse signals in random fields, with applications to brain mapping.” JASA, 102: 913-928.

Thank You

1. Perform each hypothesis test individually.

2. Obtain the test-statistic for each test.

3. Compare distribution of test-statistics to the theoretical null distribution.

4. Adjust theoretical null so that it matches the observed statistics in a low signal region.

Empirical Null

Theoretical Null

Efron (2004)

Theoretical Null

Empirical Null

Efron (2004)

Empirical Null Results in Incorrect Null Distribution

Dep. Kernel

• Observed statistics or observed P-values come from mixture distribution:

π0g0 + π1g1

• Dependence distorts g0 … can go either way:

• Must use full data set to capture dependence

With Confounding Empirical Null is Ill-Posed

JHU Job Talk

Data & Analytics

Transcript of JHU Job Talk