presentation in 1000 Genomes Phase2 meeting

Phase2 SNP calling: Low-‐coverage and exome

Jin Yu 2012/05/07

Overview

•  Two ideas in Phase2 SNP calling – Use exome off-‐target reads for whole genome SNP site discovery

– Use exome genotype calls to improve overall genotype accuracy

•  Preliminary results and plan

Review of phase1 SNP pipeline Low coverage (~4X) WGS

BAMs High coverage (~50X) WES

BAMs

MulV-‐sample calling Single-‐sample calling

PopulaVon SNP sites Individual SNP sites and genotypes

Apply mulV-‐center consensus strategy and merge SNP sites

Impute genotype/haplotype

Calculate genotype likelihood on all candidate sites

Two ideas in Phase2 Low coverage (~4X) WGS

BAMs High coverage (~50X) WES

BAMs

MulV-‐sample calling Single-‐sample calling

PopulaVon SNP sites Individual SNP sites and genotypes

Apply mulV-‐center consensus strategy and merge SNP sites

Impute genotype/haplotype

Calculate genotype likelihood on all candidate sites

The first idea

•  Use exome off-‐target reads in whole genome SNP calling – Exome off-‐target reads have significant coverage on whole genome coverage

– Preliminary results showed higher SNP sensiVvity and reasonable quality

Exome off-‐target reads are significant

0

5

10

15

20

NA1

1832

NA1

2144

NA1

2043

NA1

2489

NA1

2828

NA1

2749

NA1

2004

NA1

1993

NA1

1995

NA1

2273

NA0

6985

NA1

2272

NA1

2843

NA1

2890

NA1

2046

NA1

2341

NA1

2286

NA1

1992

NA1

2348

NA1

1918

NA1

2275

NA1

1893

NA1

2827

NA1

2718

NA1

1919

NA0

7048

NA1

2778

NA1

2546

NA1

2777

NA0

6989

NA1

2829

NA1

1881

NA1

2751

NA1

2717

NA1

2400

NA1

2340

NA1

1843

NA1

2154

NA1

2716

NA1

2155

NA1

2750

NA0

7037

NA1

2249

NA1

0847

NA1

1994

Average coverage on off-‐target regions

exome

lowpass

0%

20%

40%

60%

80%

100%

NA1

1832

NA1

2144

NA1

2043

NA1

2489

NA1

2828

NA1

2749

NA1

2004

NA1

1993

NA1

1995

NA1

2273

NA0

6985

NA1

2272

NA1

2843

NA1

2890

NA1

2046

NA1

2341

NA1

2286

NA1

1992

NA1

2348

NA1

1918

NA1

2275

NA1

1893

NA1

2827

NA1

2718

NA1

1919

NA0

7048

NA1

2778

NA1

2546

NA1

2777

NA0

6989

NA1

2829

NA1

1881

NA1

2751

NA1

2717

NA1

2400

NA1

2340

NA1

1843

NA1

2154

NA1

2716

NA1

2155

NA1

2750

NA0

7037

NA1

2249

NA1

0847

NA1

1994

% off-‐target reads in exome capture sequencing

•  ~50% of exome capture sequencing reads are off-‐target •  off-‐target reads add ~1X average coverage across the whole genome

Exome off-‐target reads are evenly distributed

• Weighted read depths calculated using EBD in 5kb sliding windows across chr20

SNP calling experiment

•  Using all 1449 phase2 lowpass BAMs and 1182 exome Illumina BAMs

•  Calling model modified from SNPtools – Combining reads of the same sample to esVmate the variance of true variant reads

– Grouping reads of the same sequencing plaborm to esVmate the variance of plaborm specific bias

SNP calls comparison (chr20 off-‐target regions)

#SNP Ti/Tv # in Phase1

Known Ti/Tv

% Rare (MAF< 1%)

% Novel to

Phase1

Novel Ti/Tv

OMNI poly sensiWvity

OMNI mono False

discovery

BI phase2 baseline 821,141 2.31 514,021 2.34 72.5% 37.4% 2.24 98.2%

(50,195/51,126) 0.9%

(12/1265)

BCM phase2 baseline 847,274 2.33 502,517 2.42 68.6% 40.7% 2.20 98.6%

(50,406/51,126) 1.9%

(24/1265)

BCM Phase2 experimental 911,602 2.32 521,189 2.42 69.7% 42.8% 2.19 98.8%

(50,494/51,126) 2.1%

(27/1265)

AddiWonal SNPs 64,328 2.17 18,672 2.26 99.1% 71.0% 2.13 0.2% (88/51,126)

0.2% (3/1265)

•  Called ~7% more SNP on off-target regions by using exome reads •  Ti/Tv and OMNI metrics showed reasonable quality •  Additional SNPs are mostly rare in phase1 calls or novel SNP

MAF distribuVon comparison (afer imputaVon)

Both increasing sample size and adding exome reads increase SNP discovery rate on the rare end (0.1% bin)

The second idea

•  Using exome calls to refine genotype imputaVon – Exome calls are of high quality and independent from AF

– Exome pipeline addressed plaborm/capture specific errors.

A snapshot of Phase1 exome SNP validaVon results

total submiYed yield validated validated/yield

singleton 5372 100 93 92 98.9%

<1% 4430 50 49 47 95.9%

>1% 1896 50 46 46 100%

SVM overall 11698 200 188 185 98.4%

Why <1% has the lowest validation rate? •  Validation sample selection •  Imputation artifacts

A closer look at imputaVon arVfacts Chr Pos Site source AC Sample picked for

validaWon PCR-‐454 validaWon

Phase1 release v1 GL in log-‐10 scale RR/RA/AA Exome calls

(BCM)

20 20033172 EX_SOLID singleton NA19468 (SOLID) 0/0 0/1 ./.:-‐5,-‐0.000391054,-‐3.04576 0/1

20 23667835 EX_ILLUMINA <1% NA18510 (Illumina) 0/0 0/1 ./.:-‐5,-‐0.00020851,-‐3.31876 0/0 or ./.

20 23667835 EX_ILLUMINA <1% NA18858 (Illumina) 0/0 0/1 ./.:-‐2.72124,-‐0.000825952,-‐5 0/0 or ./.

20 25478962 EX_ILLUMINA <1% HG00104 (SOLiD) 0/0 0/1 ./.:-‐5,0,-‐5 0/0 or ./.

20 25478962 EX_ILLUMINA <1% HG00234 (SOLiD) 0/0 0/1 ./.:-‐3.1549,-‐0.000304111,-‐5 0/0 or ./.

20 25478962 EX_ILLUMINA <1% HG00364 (SOLiD) 0/0 0/1 ./.:-‐4.69838,-‐8.69777e-‐06,-‐5 0/0 or ./.



20 60885811 EX_ILLUMINA <1% HG00134 (SOLiD) 0/0 0/1 ./.:-‐0.477139,-‐0.477113,-‐0.477113 0/0 or ./.


20 62326235 EX_ILLUMINA <1% HG00128 (SOLiD) 0/0 0/1 ./.:-‐4.22169,-‐2.6068e-‐05,-‐5 0/0 or ./.


SNPs were called in one sample but incorrectly imputed in other samples

IntegraVng exome genotypes with GL

Override generic GL by exome and SNP array genotypes

Future work

•  use both Illumina and SOLiD exome data to assist whole genome SNP calling in next experiment

•  integrate exome genotype calls in whole genome imputaVon

Acknowledgements

HGSC-‐BCM •  Fuli Yu •  Danny Challis •  Uday Evani •  Majhew Baibridge •  Donna Muzny •  Jeffrey Reid •  Richard Gibbs •  Yi Wang

BlueBiou@Rice •  Research CompuVng group

presentation in 1000 Genomes Phase2 meeting

Documents

Transcript of presentation in 1000 Genomes Phase2 meeting