presentation in 1000 Genomes Phase2 meeting
Transcript of presentation in 1000 Genomes Phase2 meeting
Phase2 SNP calling: Low-‐coverage and exome
Jin Yu 2012/05/07
Overview
• Two ideas in Phase2 SNP calling – Use exome off-‐target reads for whole genome SNP site discovery
– Use exome genotype calls to improve overall genotype accuracy
• Preliminary results and plan
Review of phase1 SNP pipeline Low coverage (~4X) WGS
BAMs High coverage (~50X) WES
BAMs
MulV-‐sample calling Single-‐sample calling
PopulaVon SNP sites Individual SNP sites and genotypes
Apply mulV-‐center consensus strategy and merge SNP sites
Impute genotype/haplotype
Calculate genotype likelihood on all candidate sites
Two ideas in Phase2 Low coverage (~4X) WGS
BAMs High coverage (~50X) WES
BAMs
MulV-‐sample calling Single-‐sample calling
PopulaVon SNP sites Individual SNP sites and genotypes
Apply mulV-‐center consensus strategy and merge SNP sites
Impute genotype/haplotype
Calculate genotype likelihood on all candidate sites
The first idea
• Use exome off-‐target reads in whole genome SNP calling – Exome off-‐target reads have significant coverage on whole genome coverage
– Preliminary results showed higher SNP sensiVvity and reasonable quality
Exome off-‐target reads are significant
0
5
10
15
20
NA1
1832
NA1
2144
NA1
2043
NA1
2489
NA1
2828
NA1
2749
NA1
2004
NA1
1993
NA1
1995
NA1
2273
NA0
6985
NA1
2272
NA1
2843
NA1
2890
NA1
2046
NA1
2341
NA1
2286
NA1
1992
NA1
2348
NA1
1918
NA1
2275
NA1
1893
NA1
2827
NA1
2718
NA1
1919
NA0
7048
NA1
2778
NA1
2546
NA1
2777
NA0
6989
NA1
2829
NA1
1881
NA1
2751
NA1
2717
NA1
2400
NA1
2340
NA1
1843
NA1
2154
NA1
2716
NA1
2155
NA1
2750
NA0
7037
NA1
2249
NA1
0847
NA1
1994
Average coverage on off-‐target regions
exome
lowpass
0%
20%
40%
60%
80%
100%
NA1
1832
NA1
2144
NA1
2043
NA1
2489
NA1
2828
NA1
2749
NA1
2004
NA1
1993
NA1
1995
NA1
2273
NA0
6985
NA1
2272
NA1
2843
NA1
2890
NA1
2046
NA1
2341
NA1
2286
NA1
1992
NA1
2348
NA1
1918
NA1
2275
NA1
1893
NA1
2827
NA1
2718
NA1
1919
NA0
7048
NA1
2778
NA1
2546
NA1
2777
NA0
6989
NA1
2829
NA1
1881
NA1
2751
NA1
2717
NA1
2400
NA1
2340
NA1
1843
NA1
2154
NA1
2716
NA1
2155
NA1
2750
NA0
7037
NA1
2249
NA1
0847
NA1
1994
% off-‐target reads in exome capture sequencing
• ~50% of exome capture sequencing reads are off-‐target • off-‐target reads add ~1X average coverage across the whole genome
Exome off-‐target reads are evenly distributed
• Weighted read depths calculated using EBD in 5kb sliding windows across chr20
SNP calling experiment
• Using all 1449 phase2 lowpass BAMs and 1182 exome Illumina BAMs
• Calling model modified from SNPtools – Combining reads of the same sample to esVmate the variance of true variant reads
– Grouping reads of the same sequencing plaborm to esVmate the variance of plaborm specific bias
SNP calls comparison (chr20 off-‐target regions)
#SNP Ti/Tv # in Phase1
Known Ti/Tv
% Rare (MAF< 1%)
% Novel to
Phase1
Novel Ti/Tv
OMNI poly sensiWvity
OMNI mono False
discovery
BI phase2 baseline 821,141 2.31 514,021 2.34 72.5% 37.4% 2.24 98.2%
(50,195/51,126) 0.9%
(12/1265)
BCM phase2 baseline 847,274 2.33 502,517 2.42 68.6% 40.7% 2.20 98.6%
(50,406/51,126) 1.9%
(24/1265)
BCM Phase2 experimental 911,602 2.32 521,189 2.42 69.7% 42.8% 2.19 98.8%
(50,494/51,126) 2.1%
(27/1265)
AddiWonal SNPs 64,328 2.17 18,672 2.26 99.1% 71.0% 2.13 0.2% (88/51,126)
0.2% (3/1265)
• Called ~7% more SNP on off-target regions by using exome reads • Ti/Tv and OMNI metrics showed reasonable quality • Additional SNPs are mostly rare in phase1 calls or novel SNP
MAF distribuVon comparison (afer imputaVon)
Both increasing sample size and adding exome reads increase SNP discovery rate on the rare end (0.1% bin)
The second idea
• Using exome calls to refine genotype imputaVon – Exome calls are of high quality and independent from AF
– Exome pipeline addressed plaborm/capture specific errors.
A snapshot of Phase1 exome SNP validaVon results
total submiYed yield validated validated/yield
singleton 5372 100 93 92 98.9%
<1% 4430 50 49 47 95.9%
>1% 1896 50 46 46 100%
SVM overall 11698 200 188 185 98.4%
Why <1% has the lowest validation rate? • Validation sample selection • Imputation artifacts
A closer look at imputaVon arVfacts Chr Pos Site source AC Sample picked for
validaWon PCR-‐454 validaWon
Phase1 release v1 GL in log-‐10 scale RR/RA/AA Exome calls
(BCM)
20 20033172 EX_SOLID singleton NA19468 (SOLID) 0/0 0/1 ./.:-‐5,-‐0.000391054,-‐3.04576 0/1
20 23667835 EX_ILLUMINA <1% NA18510 (Illumina) 0/0 0/1 ./.:-‐5,-‐0.00020851,-‐3.31876 0/0 or ./.
20 23667835 EX_ILLUMINA <1% NA18858 (Illumina) 0/0 0/1 ./.:-‐2.72124,-‐0.000825952,-‐5 0/0 or ./.
20 25478962 EX_ILLUMINA <1% HG00104 (SOLiD) 0/0 0/1 ./.:-‐5,0,-‐5 0/0 or ./.
20 25478962 EX_ILLUMINA <1% HG00234 (SOLiD) 0/0 0/1 ./.:-‐3.1549,-‐0.000304111,-‐5 0/0 or ./.
20 25478962 EX_ILLUMINA <1% HG00364 (SOLiD) 0/0 0/1 ./.:-‐4.69838,-‐8.69777e-‐06,-‐5 0/0 or ./.
20 25478962 EX_ILLUMINA <1% HG00593 (SOLiD) 0/0 0/1 ./.:-‐3.1938,-‐0.000278053,-‐5 0/0 or ./.
20 25478962 EX_ILLUMINA <1% HG01271 (SOLiD) 0/0 0/1 ./.:-‐0.31142,-‐0.290883,-‐5 0/0 or ./.
20 60885811 EX_ILLUMINA <1% HG00134 (SOLiD) 0/0 0/1 ./.:-‐0.477139,-‐0.477113,-‐0.477113 0/0 or ./.
20 60885811 EX_ILLUMINA <1% HG00350 (SOLiD) 0/0 0/1 ./.:-‐0.123447,-‐0.61343,-‐2.41117 0/0 or ./.
20 62326235 EX_ILLUMINA <1% HG00128 (SOLiD) 0/0 0/1 ./.:-‐4.22169,-‐2.6068e-‐05,-‐5 0/0 or ./.
20 62326235 EX_ILLUMINA <1% HG00179 (SOLiD) 0/0 0/1 ./.:-‐3.22182,-‐0.000773747,-‐2.92812 0/0 or ./.
SNPs were called in one sample but incorrectly imputed in other samples
IntegraVng exome genotypes with GL
Override generic GL by exome and SNP array genotypes
Future work
• use both Illumina and SOLiD exome data to assist whole genome SNP calling in next experiment
• integrate exome genotype calls in whole genome imputaVon
Acknowledgements
HGSC-‐BCM • Fuli Yu • Danny Challis • Uday Evani • Majhew Baibridge • Donna Muzny • Jeffrey Reid • Richard Gibbs • Yi Wang
BlueBiou@Rice • Research CompuVng group