Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College...

13
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor Laboratory May 5-6. 2008

description

Variant discovery is a complex process aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct seq. readssamplesfragments population genotype priors allele sampling likelihoods base error probabilities aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct G1G1 G2G2 G3G3

Transcript of Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College...

Page 1: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Variant calling: number of individuals vs. depth of read coverage

Gabor T. MarthBoston College Biology Department

1000 Genomes MeetingCold Spring Harbor LaboratoryMay 5-6. 2008

Page 2: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Single-base variant calling in 1000G data

1. SNP discovery (for potential follow-up genotyping)

2. Possibly using genotypes called from sequence directly for haplotype phasing (genotype imputation?)Sample size x read coverage / individual = constant

What is the best sample size?Not easy to answer only based on idealistic theoretical considerationsSimulation studies must model many effects to be realistic

Page 3: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Variant discovery is a complex process

aacgtCaggctaacgtCaggctaacgtCaggct

aacgtCaggct

aacgtTaggct

aacgtTaggctaacgtTaggctaacgtTaggct

aacgtTaggct

aacgtTaggct

aacgtCaggctaacgtCaggct

aacgtTaggct

seq. readssamples fragmentspopulation

genotype priors

allele sampling likelihoods

base error probabilities

aacgtCaggctaacgtCaggct

aacgtCaggctaacgtTaggct

aacgtTaggctaacgtTaggct

aacgtCaggctaacgtCaggct

aacgtTaggct

aacgtCaggct

aacgtTaggct

aacgtCaggctaacgtCaggct

aacgtCaggct

aacgtTaggct

aacgtTaggctaacgtTaggct

aacgtTaggct

aacgtTaggct

aacgtCaggctaacgtCaggct

aacgtCaggctaacgtTaggct

aacgtCaggctaacgtCaggct

aacgtTaggctaacgtTaggct

aacgtCaggctaacgtTaggct

aacgtCaggctaacgtCaggct

G1

G2

G3

Page 4: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Bayesian variant detection math

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

i i nl kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

Priors: (1) Nucleotide diversity; (2) Allele frequency distribution; (3) Specific diploid genotype layoutAllele sampling likelihoods: Binomial distribution of the number of reads from each of the two chromosomesBase error probabilities: Likelihood that the called base faithfully represents DNA fragment, calculated from the base quality values

Page 5: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

SNP calling and genotyping

P(SNP) = total probability of all non-monomorphic genotype combinations

P(Gi) = marginal probability

consequence: data from other individuals influence the genotype call of a given individual: include illustration using testProb program in GigaBayes package.

Page 6: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Variant calling in simulated data: design

Analysis by Aaron Quinlan(see poster at the Genome Meeting)

Page 7: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Estimated vs. population allele frequency

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 100 Inds @ 16X

corr = 0.89

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 100 Inds @ 16X

corr = 0.89

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 200 Inds @ 8X

corr = 0.92

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 200 Inds @ 8X

corr = 0.92

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 400 Inds @ 4X

corr = 0.93

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 400 Inds @ 4X

corr = 0.93

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

Page 8: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Allele frequency (cont’d)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Page 9: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

SNP discovery sensitivity

Page 10: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Genotype density

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNP Position

Frac

tion

of c

onfid

ent g

enot

ypes

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

1400

1600

100 @ 16x: 0.975 +/- 0.121

200 @ 8x: 0.968 +/- 0.129

400 @ 4x: 0.924 +/- 0.151

800 @ 2x: 0.769 +/- 0.154

Page 11: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Genotype density

Fraction of samples w/ confident genotype Fraction of samples w/ confident genotype

Fraction of samples w/ confident genotypeFraction of samples w/ confident genotype

Num

ber o

f SN

PsN

umbe

r of S

NPs

Num

ber o

f SN

PsN

umbe

r of S

NPs

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500

3000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

100 Inds @ 16X 200 Inds @ 8X

400 Inds @ 4X 800 Inds @ 2X

Fraction of samples w/ confident genotype Fraction of samples w/ confident genotype

Fraction of samples w/ confident genotypeFraction of samples w/ confident genotype

Num

ber o

f SN

PsN

umbe

r of S

NPs

Num

ber o

f SN

PsN

umbe

r of S

NPs

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500

3000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

100 Inds @ 16X 200 Inds @ 8X

400 Inds @ 4X 800 Inds @ 2X

Page 12: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Summary / Conclusions

Page 13: Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Thanks