Download - Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Variant calling: number of individuals vs. depth of read coverage

Gabor T. MarthBoston College Biology Department

1000 Genomes MeetingCold Spring Harbor LaboratoryMay 5-6. 2008

Single-base variant calling in 1000G data

1. SNP discovery (for potential follow-up genotyping)

2. Possibly using genotypes called from sequence directly for haplotype phasing (genotype imputation?)Sample size x read coverage / individual = constant

What is the best sample size?Not easy to answer only based on idealistic theoretical considerationsSimulation studies must model many effects to be realistic

Variant discovery is a complex process

aacgtCaggctaacgtCaggctaacgtCaggct

aacgtCaggct

aacgtTaggct

aacgtTaggctaacgtTaggctaacgtTaggct

aacgtTaggct

aacgtTaggct

aacgtCaggctaacgtCaggct

aacgtTaggct

seq. readssamples fragmentspopulation

genotype priors

allele sampling likelihoods

base error probabilities


aacgtCaggctaacgtTaggct

aacgtTaggctaacgtTaggct


aacgtTaggct

aacgtCaggct

aacgtTaggct


aacgtCaggct

aacgtTaggct


aacgtTaggct

aacgtTaggct







G1

G2

G3

Bayesian variant detection math

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

i i nl kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

Priors: (1) Nucleotide diversity; (2) Allele frequency distribution; (3) Specific diploid genotype layoutAllele sampling likelihoods: Binomial distribution of the number of reads from each of the two chromosomesBase error probabilities: Likelihood that the called base faithfully represents DNA fragment, calculated from the base quality values

SNP calling and genotyping

P(SNP) = total probability of all non-monomorphic genotype combinations

P(Gi) = marginal probability

consequence: data from other individuals influence the genotype call of a given individual: include illustration using testProb program in GigaBayes package.

Variant calling in simulated data: design

Analysis by Aaron Quinlan(see poster at the Genome Meeting)

Estimated vs. population allele frequency

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Expected minor allele frequency

Obs

erve

d m

inor

alle

le fr

eque

ncy 100 Inds @ 16X

corr = 0.89

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 100 Inds @ 16X

corr = 0.89

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 200 Inds @ 8X

corr = 0.92

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 200 Inds @ 8X

corr = 0.92

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 400 Inds @ 4X

corr = 0.93

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 400 Inds @ 4X

corr = 0.93

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

Allele frequency (cont’d)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Obs

erve

d m

inor

alle

le fr

eque

ncy 800 Inds @ 4X

corr = 0.91

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

SNP discovery sensitivity

Genotype density

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNP Position

Frac

tion

of c

onfid

ent g

enot

ypes

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

1400

1600

100 @ 16x: 0.975 +/- 0.121

200 @ 8x: 0.968 +/- 0.129

400 @ 4x: 0.924 +/- 0.151

800 @ 2x: 0.769 +/- 0.154

Genotype density

Fraction of samples w/ confident genotype Fraction of samples w/ confident genotype

Fraction of samples w/ confident genotypeFraction of samples w/ confident genotype

Num

ber o

f SN

PsN

umbe

r of S

NPs

Num

ber o

f SN

PsN

umbe

r of S

NPs

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500

3000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

100 Inds @ 16X 200 Inds @ 8X

400 Inds @ 4X 800 Inds @ 2X

Fraction of samples w/ confident genotype Fraction of samples w/ confident genotype

Fraction of samples w/ confident genotypeFraction of samples w/ confident genotype

Num

ber o

f SN

PsN

umbe

r of S

NPs

Num

ber o

f SN

PsN

umbe

r of S

NPs

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500

3000

0 0.2 0.4 0.6 0.8 10

500

1000

1500

100 Inds @ 16X 200 Inds @ 8X

400 Inds @ 4X 800 Inds @ 2X

Summary / Conclusions

Thanks