presentation in 1000 Genomes Phase2 meeting

16
Phase2 SNP calling: Lowcoverage and exome Jin Yu 2012/05/07

Transcript of presentation in 1000 Genomes Phase2 meeting

Page 1: presentation in 1000 Genomes Phase2 meeting

Phase2  SNP  calling:    Low-­‐coverage  and  exome  

Jin  Yu  2012/05/07  

Page 2: presentation in 1000 Genomes Phase2 meeting

Overview  

•  Two  ideas  in  Phase2  SNP  calling  – Use  exome  off-­‐target  reads  for  whole  genome  SNP  site  discovery  

– Use  exome  genotype  calls  to  improve  overall  genotype  accuracy  

•  Preliminary  results  and  plan  

Page 3: presentation in 1000 Genomes Phase2 meeting

Review  of  phase1  SNP  pipeline  Low  coverage  (~4X)  WGS  

BAMs  High  coverage  (~50X)  WES  

BAMs  

MulV-­‐sample  calling   Single-­‐sample  calling  

PopulaVon  SNP  sites   Individual  SNP  sites  and  genotypes  

 Apply  mulV-­‐center  consensus  strategy  and  merge  SNP  sites  

Impute  genotype/haplotype  

Calculate  genotype  likelihood  on  all  candidate  sites  

Page 4: presentation in 1000 Genomes Phase2 meeting

Two  ideas  in  Phase2  Low  coverage  (~4X)  WGS  

BAMs  High  coverage  (~50X)  WES  

BAMs  

MulV-­‐sample  calling   Single-­‐sample  calling  

PopulaVon  SNP  sites   Individual  SNP  sites  and  genotypes  

 Apply  mulV-­‐center  consensus  strategy  and  merge  SNP  sites  

Impute  genotype/haplotype  

Calculate  genotype  likelihood  on  all  candidate  sites  

Page 5: presentation in 1000 Genomes Phase2 meeting

The  first  idea  

•  Use  exome  off-­‐target  reads  in  whole  genome  SNP  calling  – Exome  off-­‐target  reads  have  significant  coverage  on  whole  genome  coverage  

– Preliminary  results  showed  higher  SNP  sensiVvity  and  reasonable  quality  

Page 6: presentation in 1000 Genomes Phase2 meeting

Exome  off-­‐target  reads  are  significant  

0  

5  

10  

15  

20  

NA1

1832  

NA1

2144  

NA1

2043  

NA1

2489  

NA1

2828  

NA1

2749  

NA1

2004  

NA1

1993  

NA1

1995  

NA1

2273  

NA0

6985  

NA1

2272  

NA1

2843  

NA1

2890  

NA1

2046  

NA1

2341  

NA1

2286  

NA1

1992  

NA1

2348  

NA1

1918  

NA1

2275  

NA1

1893  

NA1

2827  

NA1

2718  

NA1

1919  

NA0

7048  

NA1

2778  

NA1

2546  

NA1

2777  

NA0

6989  

NA1

2829  

NA1

1881  

NA1

2751  

NA1

2717  

NA1

2400  

NA1

2340  

NA1

1843  

NA1

2154  

NA1

2716  

NA1

2155  

NA1

2750  

NA0

7037  

NA1

2249  

NA1

0847  

NA1

1994  

Average  coverage  on  off-­‐target  regions  

exome  

lowpass  

0%  

20%  

40%  

60%  

80%  

100%  

NA1

1832  

NA1

2144  

NA1

2043  

NA1

2489  

NA1

2828  

NA1

2749  

NA1

2004  

NA1

1993  

NA1

1995  

NA1

2273  

NA0

6985  

NA1

2272  

NA1

2843  

NA1

2890  

NA1

2046  

NA1

2341  

NA1

2286  

NA1

1992  

NA1

2348  

NA1

1918  

NA1

2275  

NA1

1893  

NA1

2827  

NA1

2718  

NA1

1919  

NA0

7048  

NA1

2778  

NA1

2546  

NA1

2777  

NA0

6989  

NA1

2829  

NA1

1881  

NA1

2751  

NA1

2717  

NA1

2400  

NA1

2340  

NA1

1843  

NA1

2154  

NA1

2716  

NA1

2155  

NA1

2750  

NA0

7037  

NA1

2249  

NA1

0847  

NA1

1994  

%  off-­‐target  reads  in  exome  capture  sequencing  

•   ~50%  of  exome  capture  sequencing  reads  are  off-­‐target  •   off-­‐target  reads  add  ~1X  average  coverage  across  the  whole  genome  

Page 7: presentation in 1000 Genomes Phase2 meeting

Exome  off-­‐target  reads  are  evenly  distributed  

• Weighted  read  depths  calculated  using  EBD  in  5kb  sliding  windows  across  chr20    

Page 8: presentation in 1000 Genomes Phase2 meeting

SNP  calling  experiment  

•  Using  all  1449  phase2  lowpass  BAMs  and  1182  exome  Illumina  BAMs  

•  Calling  model  modified  from  SNPtools  – Combining  reads  of  the  same  sample  to  esVmate  the  variance  of  true  variant  reads  

– Grouping  reads  of  the  same  sequencing  plaborm  to  esVmate  the  variance  of  plaborm  specific  bias  

Page 9: presentation in 1000 Genomes Phase2 meeting

SNP  calls  comparison    (chr20  off-­‐target  regions)  

#SNP   Ti/Tv   #  in  Phase1  

Known  Ti/Tv  

%  Rare    (MAF<  1%)  

%  Novel  to  

Phase1  

Novel  Ti/Tv  

OMNI  poly  sensiWvity  

OMNI  mono  False  

discovery  

BI  phase2  baseline   821,141   2.31   514,021     2.34     72.5%   37.4%   2.24   98.2%  

(50,195/51,126)  0.9%  

(12/1265)  

BCM  phase2  baseline   847,274    2.33   502,517     2.42     68.6%   40.7%   2.20   98.6%  

(50,406/51,126)  1.9%  

(24/1265)  

BCM  Phase2  experimental     911,602    2.32   521,189     2.42     69.7%   42.8%   2.19   98.8%  

(50,494/51,126)  2.1%  

(27/1265)  

AddiWonal  SNPs   64,328     2.17   18,672     2.26     99.1%   71.0%   2.13   0.2%  (88/51,126)  

0.2%  (3/1265)  

•  Called ~7% more SNP on off-target regions by using exome reads •  Ti/Tv and OMNI metrics showed reasonable quality •  Additional SNPs are mostly rare in phase1 calls or novel SNP

Page 10: presentation in 1000 Genomes Phase2 meeting

MAF  distribuVon  comparison    (afer  imputaVon)  

Both increasing sample size and adding exome reads increase SNP discovery rate on the rare end (0.1% bin)

Page 11: presentation in 1000 Genomes Phase2 meeting

The  second  idea  

•  Using  exome  calls  to  refine  genotype  imputaVon  – Exome  calls  are  of  high  quality  and  independent  from  AF  

– Exome  pipeline  addressed  plaborm/capture  specific  errors.  

Page 12: presentation in 1000 Genomes Phase2 meeting

A  snapshot  of  Phase1  exome  SNP  validaVon  results  

total   submiYed   yield   validated   validated/yield  

singleton   5372   100   93   92   98.9%  

<1%   4430   50   49   47   95.9%  

>1%   1896   50   46   46   100%  

SVM  overall   11698   200   188   185   98.4%  

Why <1% has the lowest validation rate? •  Validation sample selection •  Imputation artifacts

Page 13: presentation in 1000 Genomes Phase2 meeting

A  closer  look  at  imputaVon  arVfacts  Chr   Pos   Site  source   AC   Sample  picked  for  

validaWon  PCR-­‐454  validaWon  

Phase1    release  v1   GL  in  log-­‐10  scale  RR/RA/AA   Exome  calls  

(BCM)  

20   20033172   EX_SOLID     singleton   NA19468    (SOLID)   0/0     0/1     ./.:-­‐5,-­‐0.000391054,-­‐3.04576   0/1    

20   23667835   EX_ILLUMINA     <1%   NA18510  (Illumina)   0/0     0/1     ./.:-­‐5,-­‐0.00020851,-­‐3.31876   0/0  or  ./.    

20   23667835   EX_ILLUMINA     <1%   NA18858  (Illumina)   0/0     0/1     ./.:-­‐2.72124,-­‐0.000825952,-­‐5   0/0  or  ./.    

20   25478962   EX_ILLUMINA     <1%   HG00104  (SOLiD)   0/0     0/1     ./.:-­‐5,0,-­‐5   0/0  or  ./.    

20   25478962   EX_ILLUMINA     <1%   HG00234  (SOLiD)   0/0     0/1     ./.:-­‐3.1549,-­‐0.000304111,-­‐5   0/0  or  ./.    

20   25478962   EX_ILLUMINA     <1%   HG00364  (SOLiD)   0/0     0/1     ./.:-­‐4.69838,-­‐8.69777e-­‐06,-­‐5   0/0  or  ./.    

20   25478962   EX_ILLUMINA     <1%   HG00593  (SOLiD)   0/0     0/1     ./.:-­‐3.1938,-­‐0.000278053,-­‐5   0/0  or  ./.    

20   25478962   EX_ILLUMINA     <1%   HG01271  (SOLiD)   0/0     0/1     ./.:-­‐0.31142,-­‐0.290883,-­‐5   0/0  or  ./.    

20   60885811   EX_ILLUMINA     <1%   HG00134  (SOLiD)   0/0     0/1     ./.:-­‐0.477139,-­‐0.477113,-­‐0.477113   0/0  or  ./.    

20   60885811   EX_ILLUMINA     <1%   HG00350  (SOLiD)     0/0     0/1     ./.:-­‐0.123447,-­‐0.61343,-­‐2.41117   0/0  or  ./.    

20   62326235   EX_ILLUMINA     <1%   HG00128  (SOLiD)     0/0     0/1     ./.:-­‐4.22169,-­‐2.6068e-­‐05,-­‐5   0/0  or  ./.    

20   62326235   EX_ILLUMINA     <1%   HG00179  (SOLiD)     0/0     0/1     ./.:-­‐3.22182,-­‐0.000773747,-­‐2.92812   0/0  or  ./.    

SNPs were called in one sample but incorrectly imputed in other samples

Page 14: presentation in 1000 Genomes Phase2 meeting

IntegraVng  exome  genotypes  with  GL    

Override generic GL by exome and SNP array genotypes

Page 15: presentation in 1000 Genomes Phase2 meeting

Future  work  

•  use  both  Illumina  and  SOLiD  exome  data  to  assist  whole  genome  SNP  calling  in  next  experiment  

•  integrate  exome  genotype  calls  in  whole  genome  imputaVon  

Page 16: presentation in 1000 Genomes Phase2 meeting

Acknowledgements  

HGSC-­‐BCM  •  Fuli  Yu  •  Danny  Challis  •  Uday  Evani  •  Majhew  Baibridge  •  Donna  Muzny  •  Jeffrey  Reid  •  Richard  Gibbs    •  Yi  Wang    

BlueBiou@Rice  •  Research  CompuVng  group