Variant calling: number of individuals vs. depth of read coverage
Gabor T. MarthBoston College Biology Department
1000 Genomes MeetingCold Spring Harbor LaboratoryMay 5-6. 2008
Single-base variant calling in 1000G data
1. SNP discovery (for potential follow-up genotyping)
2. Possibly using genotypes called from sequence directly for haplotype phasing (genotype imputation?)Sample size x read coverage / individual = constant
What is the best sample size?Not easy to answer only based on idealistic theoretical considerationsSimulation studies must model many effects to be realistic
Variant discovery is a complex process
aacgtCaggctaacgtCaggctaacgtCaggct
aacgtCaggct
aacgtTaggct
aacgtTaggctaacgtTaggctaacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtCaggctaacgtCaggct
aacgtTaggct
seq. readssamples fragmentspopulation
genotype priors
allele sampling likelihoods
base error probabilities
aacgtCaggctaacgtCaggct
aacgtCaggctaacgtTaggct
aacgtTaggctaacgtTaggct
aacgtCaggctaacgtCaggct
aacgtTaggct
aacgtCaggct
aacgtTaggct
aacgtCaggctaacgtCaggct
aacgtCaggct
aacgtTaggct
aacgtTaggctaacgtTaggct
aacgtTaggct
aacgtTaggct
aacgtCaggctaacgtCaggct
aacgtCaggctaacgtTaggct
aacgtCaggctaacgtCaggct
aacgtTaggctaacgtTaggct
aacgtCaggctaacgtTaggct
aacgtCaggctaacgtCaggct
G1
G2
G3
Bayesian variant detection math
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
i i nl kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
Priors: (1) Nucleotide diversity; (2) Allele frequency distribution; (3) Specific diploid genotype layoutAllele sampling likelihoods: Binomial distribution of the number of reads from each of the two chromosomesBase error probabilities: Likelihood that the called base faithfully represents DNA fragment, calculated from the base quality values
SNP calling and genotyping
P(SNP) = total probability of all non-monomorphic genotype combinations
P(Gi) = marginal probability
consequence: data from other individuals influence the genotype call of a given individual: include illustration using testProb program in GigaBayes package.
Variant calling in simulated data: design
Analysis by Aaron Quinlan(see poster at the Genome Meeting)
Estimated vs. population allele frequency
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 100 Inds @ 16X
corr = 0.89
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 100 Inds @ 16X
corr = 0.89
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 200 Inds @ 8X
corr = 0.92
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 200 Inds @ 8X
corr = 0.92
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 400 Inds @ 4X
corr = 0.93
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 400 Inds @ 4X
corr = 0.93
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 800 Inds @ 4X
corr = 0.91
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 800 Inds @ 4X
corr = 0.91
Allele frequency (cont’d)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 800 Inds @ 4X
corr = 0.91
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Expected minor allele frequency
Obs
erve
d m
inor
alle
le fr
eque
ncy 800 Inds @ 4X
corr = 0.91
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
SNP discovery sensitivity
Genotype density
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNP Position
Frac
tion
of c
onfid
ent g
enot
ypes
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
200
400
600
800
1000
1200
1400
1600
100 @ 16x: 0.975 +/- 0.121
200 @ 8x: 0.968 +/- 0.129
400 @ 4x: 0.924 +/- 0.151
800 @ 2x: 0.769 +/- 0.154
Genotype density
Fraction of samples w/ confident genotype Fraction of samples w/ confident genotype
Fraction of samples w/ confident genotypeFraction of samples w/ confident genotype
Num
ber o
f SN
PsN
umbe
r of S
NPs
Num
ber o
f SN
PsN
umbe
r of S
NPs
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000
2500
3000
0 0.2 0.4 0.6 0.8 10
500
1000
1500
100 Inds @ 16X 200 Inds @ 8X
400 Inds @ 4X 800 Inds @ 2X
Fraction of samples w/ confident genotype Fraction of samples w/ confident genotype
Fraction of samples w/ confident genotypeFraction of samples w/ confident genotype
Num
ber o
f SN
PsN
umbe
r of S
NPs
Num
ber o
f SN
PsN
umbe
r of S
NPs
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000
2500
3000
0 0.2 0.4 0.6 0.8 10
500
1000
1500
100 Inds @ 16X 200 Inds @ 8X
400 Inds @ 4X 800 Inds @ 2X
Summary / Conclusions
Thanks
Top Related