C2BAT: Using the same data set for screening and testing.
A testing strategy for genome-wide association studies in case/control design
Matt McQueen, Jessica Su, Nan Laird and Christoph Lange
Harvard School of Public Health
Genome-wide association studies
Limitation of linkage analysis and the potential of association analysis
=>
genome-wide association studies
(Risch & Merikangas 1997)
100,000 > SNPs and phenotypes are tested for association.
Statistical road block:
Severe multiple testing problem!!!
“Using the same data set for screening and testing”
• Testing strategy:
– Assess evidence for association for all SNPs based on S (Screening Step)– Select a small subset of N markers (10-200)– Compute the association test conditional upon S and adjust N comparisons
(Testing Step)
– If the screening step and the testing step are statistically independent, we can look at the data in the screening step without paying a “statistical price” for it.
Screening technique S
Testing statistic T
“Using the same data set for screening and testing”
General concept proposed by Laird and Lange (2006, Nat Rev Genet)
Decomposition of joint-likelihood:
P( {phenotype, genotype} ) = P( {phenotype, genotype} | S({phenotype, genotype}) ) * P(S{phenotype, genotype})
• S = “Summary test statistic to assess evidence for association”
• Requirements for S:
– The association test has to condition on S– S has to contain information about the potential association as well
= Screening Step= Testing Step
• Testing strategy:
– Assess evidence for association for all SNPs based on S (Screening Step)– Select a small subset of N markers (10-200)– Compute the association test conditional upon S and adjust N comparisons
(Testing Step)
– The screening step and the testing step are statistically independent !!!
“Using the same data set for screening and testing”
Application to family-based association tests (VanSteen et al (2005))
Decomposition of joint-likelihood:
P( {phenotype, genotype, parent genotype} ) = P( {phenotype, genotype} | {phenotype, par. genotype} ) * P( {phenotype, par genotype})
• S = “phenotype and parental genotype/sufficient statistic”
= Screening Step based on conditional mean model
Lange et al (2003)
= Testing Step based FBAT Laird et al (2000)
• Properties of the testing strategy:
– Outperforms standard adjustments for multiple comparions by factors up to 40
– Additional power boost by the use of complex phenotypes such as longitudinal data:Discovery of INSIG2 in a 100K-scan in the Framingham Heart StudyFirst replicable association for BMI / obesity (Herbert et al (2006, Science))
• Alternative approach:
– Instead of using the between-component (Screening step) and the within-component (Testing Step) in 2 stage testing strategy one could include both components in the test statistics, e.g. QTDT (Abecasis et al (2000))
– Disadvantages:– Only marginal power gains (5%) over the FBAT-statistic when a single SNP is tested
(Abecasis et al (2001)) – Lack of robustness against population admixture (Yu et al (2006))
= Within-family component (Fulker et al (1999))
= Between-family component Fulker et al (1999)
“Using the same data set for screening and testing”
Can we translate this concept to association studies in unrelated cases and controls?
2Tests and Amitrage-trend tests are conditional tests that condition upon the margins=>
The data-partitioning statistic S are margins of the tableCOMPLETE SET
Number of Alleles
0 1 2
Cases 125 265 110 500
Controls 173 241 86 500
298 506 196 1000
COMPLETE SET
Number of Alleles
0 1 2
Cases 125 265 110 500
Controls 173 241 86 500
298 506 196 1000
ESTIMATION SET
Number of Alleles
0 1 2
Cases 63 133 55 250
Controls 85 120 43 250
148 253 98500.
TESTING SET
Number of Alleles
0 1 2
Cases 63 133 55 250
Controls 85 120 43 250
148 253 98500.
ESTIMATION SET
Number of Alleles
0 1 2
Cases 94 133 28 255
Controls 130 120 21 271
22475%
25350%
4925%
526.
TESTING SET
Number of Alleles
0 1 2
Cases 31 132 82 245
Controls 43 121 65 229
7425%
25350%
14775%
474.
= Screening Step = Testing Step
Testing strategy:1.) Divide table into a “screening table” and a “testing table“2.) For each SNP, use the “screening table” and the margins of the “testing table” to assess evidence for association in the screening step3.) Select the most promising N SNPs and test them for association based on the data of the testing table.
How can we obtain information about an association from the margins?
NON-INFORMATIVE SET
Number of Alleles
0 1 2
Cases 94 133 28 255
Controls 130 120 21 271
224 253 49 526
TESTING SET
Number of Alleles
0 1 2
Cases 31 132 82 245
Controls 43 121 65 229
74 253 147 474
+
IMPUTED SET
Number of Alleles
0 1 2
Cases 31 131 83 245
Controls 43 122 64 229
74 253 147 N(T)
MARGINAL SET
Number of Alleles
0 1 2
Cases . . . 245
Controls . . . 229
74 253 147 474
SCREENING SET
Number of Alleles
0 1 2
Cases 125 264 111 500
Controls
173 242 85 500
298 506 196 1000
Results will depend on the actual random split-up of the tables!
Solution:1.) Re-sampling of the tables2.) p-value for testing set based on p(data)=p(data|S(data))*p(S(data)) and Monte-Carlo simulations
Simulation Study
Cases/Controls OR SNPs Method Allele Frequencies
0.10 0.20 0.30 0.40
500 1.50 100,000 C2BAT 0.06 0.18 0.25 0.29
Standard 0.01 0.09 0.14 0.15
500 1.60 100,000 C2BAT 0.14 0.36 0.49 0.46
Standard 0.02 0.19 0.34 0.30
700 1.50 100,000 C2BAT 0.13 0.36 0.56 0.53
Standard 0.03 0.18 0.42 0.31
700 1.60 100,000 C2BAT 0.27 0.57 0.84 0.85
Standard 0.09 0.35 0.64 0.68
Can C2BAT find INSIG2 in the 100K-scan in Framingham Heart Study
again ?
1400 probands in about 300 families:
Randomly select 150 unrelated cases/controls (BMI>28 = “affected”)
=>Apply standard analysis (p-value adjusted by Bonferroni correction)
and C2BAT to see whether INSIG2 reaches genome-wide significance
For 1000, replicates:
Power of standard analysis to detect INSIG2: 5%
Power of C2BAT to detect INSIG2: 17%
Future work:
1.) Extension to quantitative traits =>Expression analysis
2.) Gene-gene interactions
Software: www.c2bat.com
Top Related