Download - C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.

C2BAT: Using the same data set for screening and testing.

A testing strategy for genome-wide association studies in case/control design

Matt McQueen, Jessica Su, Nan Laird and Christoph Lange

Harvard School of Public Health

Genome-wide association studies

Limitation of linkage analysis and the potential of association analysis

=>

genome-wide association studies

(Risch & Merikangas 1997)

100,000 > SNPs and phenotypes are tested for association.

Statistical road block:

Severe multiple testing problem!!!

“Using the same data set for screening and testing”

• Testing strategy:

– Assess evidence for association for all SNPs based on S (Screening Step)– Select a small subset of N markers (10-200)– Compute the association test conditional upon S and adjust N comparisons

(Testing Step)

– If the screening step and the testing step are statistically independent, we can look at the data in the screening step without paying a “statistical price” for it.

Screening technique S

Testing statistic T


General concept proposed by Laird and Lange (2006, Nat Rev Genet)

Decomposition of joint-likelihood:

P( {phenotype, genotype} ) = P( {phenotype, genotype} | S({phenotype, genotype}) ) * P(S{phenotype, genotype})

• S = “Summary test statistic to assess evidence for association”

• Requirements for S:

– The association test has to condition on S– S has to contain information about the potential association as well

= Screening Step= Testing Step

• Testing strategy:

– Assess evidence for association for all SNPs based on S (Screening Step)– Select a small subset of N markers (10-200)– Compute the association test conditional upon S and adjust N comparisons

(Testing Step)

– The screening step and the testing step are statistically independent !!!


Application to family-based association tests (VanSteen et al (2005))

Decomposition of joint-likelihood:

P( {phenotype, genotype, parent genotype} ) = P( {phenotype, genotype} | {phenotype, par. genotype} ) * P( {phenotype, par genotype})

• S = “phenotype and parental genotype/sufficient statistic”

= Screening Step based on conditional mean model

Lange et al (2003)

= Testing Step based FBAT Laird et al (2000)

• Properties of the testing strategy:

– Outperforms standard adjustments for multiple comparions by factors up to 40

– Additional power boost by the use of complex phenotypes such as longitudinal data:Discovery of INSIG2 in a 100K-scan in the Framingham Heart StudyFirst replicable association for BMI / obesity (Herbert et al (2006, Science))

• Alternative approach:

– Instead of using the between-component (Screening step) and the within-component (Testing Step) in 2 stage testing strategy one could include both components in the test statistics, e.g. QTDT (Abecasis et al (2000))

– Disadvantages:– Only marginal power gains (5%) over the FBAT-statistic when a single SNP is tested

(Abecasis et al (2001)) – Lack of robustness against population admixture (Yu et al (2006))

= Within-family component (Fulker et al (1999))

= Between-family component Fulker et al (1999)


Can we translate this concept to association studies in unrelated cases and controls?

2Tests and Amitrage-trend tests are conditional tests that condition upon the margins=>

The data-partitioning statistic S are margins of the tableCOMPLETE SET

Number of Alleles

0 1 2

Cases 125 265 110 500

Controls 173 241 86 500

298 506 196 1000

COMPLETE SET

Number of Alleles

0 1 2

Cases 125 265 110 500

Controls 173 241 86 500

298 506 196 1000

ESTIMATION SET

Number of Alleles

0 1 2

Cases 63 133 55 250

Controls 85 120 43 250

148 253 98500.

TESTING SET

Number of Alleles

0 1 2

Cases 63 133 55 250

Controls 85 120 43 250

148 253 98500.

ESTIMATION SET

Number of Alleles

0 1 2

Cases 94 133 28 255

Controls 130 120 21 271

22475%

25350%

4925%

526.

TESTING SET

Number of Alleles

0 1 2

Cases 31 132 82 245

Controls 43 121 65 229

7425%

25350%

14775%

474.

= Screening Step = Testing Step

Testing strategy:1.) Divide table into a “screening table” and a “testing table“2.) For each SNP, use the “screening table” and the margins of the “testing table” to assess evidence for association in the screening step3.) Select the most promising N SNPs and test them for association based on the data of the testing table.

How can we obtain information about an association from the margins?

NON-INFORMATIVE SET

Number of Alleles

0 1 2

Cases 94 133 28 255

Controls 130 120 21 271

224 253 49 526

TESTING SET

Number of Alleles

0 1 2

Cases 31 132 82 245

Controls 43 121 65 229

74 253 147 474

+

IMPUTED SET

Number of Alleles

0 1 2

Cases 31 131 83 245

Controls 43 122 64 229

74 253 147 N(T)

MARGINAL SET

Number of Alleles

0 1 2

Cases . . . 245

Controls . . . 229

74 253 147 474

SCREENING SET

Number of Alleles

0 1 2

Cases 125 264 111 500

Controls

173 242 85 500

298 506 196 1000

Results will depend on the actual random split-up of the tables!

Solution:1.) Re-sampling of the tables2.) p-value for testing set based on p(data)=p(data|S(data))*p(S(data)) and Monte-Carlo simulations

Simulation Study

Cases/Controls OR SNPs Method Allele Frequencies

0.10 0.20 0.30 0.40

500 1.50 100,000 C2BAT 0.06 0.18 0.25 0.29

Standard 0.01 0.09 0.14 0.15

500 1.60 100,000 C2BAT 0.14 0.36 0.49 0.46

Standard 0.02 0.19 0.34 0.30

700 1.50 100,000 C2BAT 0.13 0.36 0.56 0.53

Standard 0.03 0.18 0.42 0.31

700 1.60 100,000 C2BAT 0.27 0.57 0.84 0.85

Standard 0.09 0.35 0.64 0.68

Can C2BAT find INSIG2 in the 100K-scan in Framingham Heart Study

again ?

1400 probands in about 300 families:

Randomly select 150 unrelated cases/controls (BMI>28 = “affected”)

=>Apply standard analysis (p-value adjusted by Bonferroni correction)

and C2BAT to see whether INSIG2 reaches genome-wide significance

For 1000, replicates:

Power of standard analysis to detect INSIG2: 5%

Power of C2BAT to detect INSIG2: 17%

Future work:

1.) Extension to quantitative traits =>Expression analysis

2.) Gene-gene interactions

Software: www.c2bat.com