Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek...
-
Upload
piers-hopkins -
Category
Documents
-
view
218 -
download
0
description
Transcript of Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek...
Characterizing the short tandem repeat mutation process at every locus in the genome
Melissa GymrekGenome Informatics 2015
@mgymrek
Genetic variation comes in many forms
ACGACTCGAGCG
ACGACACGAGCG
μSNP: 1.20 × 10-8 /loc/gen
SNP
ACGACTCGAGCG
ACGAC-CGAGCGμINDEL: 0.68 × 10-9 /loc/gen
Short indel (1-20bp)
Short tandem repeat
CAGCAG---CAGCAGCA
CAGCAGCAGCAGCAGCA
μSTR: 10-2-10-5 /loc/gen
Alu retrotransposition
Alu
Struct. Var /CNV (>20bp)
STR 500
Alu 0.05
SV 0.2
Indel 3
SNP 50
# de novo/gen
STR 500
Alu 0.05
SV 0.2
Indel 3SNP 50
0
100
200
300
400
500
# de novo/gen
0
100
200
300
400
500
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
eSTRs contribute to gene expression variability
Obse
rved
p-v
alue
[-lo
g10]
Expected p-value under the null [-log10]
Gene(TG)
STR
Expr
essio
n
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Why study the STR mutation process?
1. Identify rapidly mutating STRs
2. Understand biological processes driving mutation patterns
3. Identify STRs under selective pressure
Haasl and Payseur 2013
H0: Locus evolves under neutral modelH1: Locus is under selection
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
STRs and SNPs provide orthogonal molecular clocks
TIME
Clock 1: SNPs
Clock 2: STRs
# mismatches ~ f(μSNP, t, …)t
(m-n)2 ~ f(μSTR, t, …)
ACCCATCCTAGCTACCGACTACAACGACCGATCCTAGCTTCCGACTACCACGACACTCATCTG(CAG)mACACACTGAACACTCATCTG(CAG)nACACACTGA
Use known value of μSNP
to calibrate the STR molecular clock
μSTR: STR mutation rate (/loc/gen)
t: Time to the most recent common ancestor (TMRCA)μSNP: SNP mutation rate (/loc/gen)
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Estimating STR mutation parameters from WGS
TMRCASTR calls
300 high coverage SGDP whole genomes
CAGm
CAGn
PSMC(Li and Durbin 2011)
SNPsTMRCA
Infer locus specific mutation params.
L
k
TMRCA
(m-n
)2
Step size
Freq
uenc
y
Learn model to predict mutation parameters from
sequence features
Diploid locus
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
We are now armed with deep WGS amenable to STR profiling
SGDP: 300 deeply sequenced, PCR free genomes with diverse origins
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Generating high quality STR genotypes
Alignment
Sample 1 Sample 2 Sample n
Alignment Alignment
FASTQ FASTQ FASTQ
BAM BAM BAM
BW
A-M
EM
Allelotype(multi-sample)
lobSTRVCFFiltering
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
lobSTR
High coverage genomes provide accurate STR genotypes
Homopolymers (n=50,398)
R2=0.92
93% concordance with capillary data
Accurately recover population structure
http://strcat.teamerlich.org/
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
(e.g. AAAAAA)
Estimating STR mutation parameters from WGS
TMRCASTR calls
300 high coverage SGDP samples
CAGm
CAGn
PSMC(Li and Durbin 2011)
SNPsTMRCA
Infer locus specific mutation params.
L
k
TMRCA
(m-n
)2
Step size
Freq
uenc
y
Learn model to predict mutation parameters from
sequence features
Diploid locus
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Measuring TMRCA using PSMC
Dis
cret
ized
TM
RC
A
Li and Durbin, Nature 2011
Maternal chromosome
Paternal chromosome
CAGm
CAGn
Measure local TMRCA
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Estimating STR mutation parameters from WGS
TMRCASTR calls
300 high coverage SGDP samples
CAGm
CAGn
PSMC(Li and Durbin 2011)
SNPsTMRCA
Infer locus specific mutation params.
L
k
TMRCA
(m-n
)2
Step size
Freq
uenc
y
Learn model to predict mutation parameters from
sequence features
Diploid locus
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
What we know about STR mutations
1. Mutate in “unit” lengths
2. Step size distribution ~Geometric
3. Length constraint biases mutation direction
4. Other important factors not modeled here
CAGCAGCAGCAGCAGCAGCAGCAG
CAGCAGCAG---CAGCAGCAGCAG
CAGCAG------CAGCAGCAGCAG
CAGCAG---CAGCAGCAGCAGCAG
CAGCAGCA-CAGCAGCAGCAGCAGSun et al. 2012
short alleles longer shorter longer • Length-dependent mutation rate
• Motif sequence interruptions
• Large expansions behave differently (e.g. Huntington’s)
• Biased gene conversion?
• Interaction between alleles?
P: probability of mutating a single step
3, 6 4, 4
4, 4
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Modeling STR mutation as a mean-centered random walk
Simple Stepwise Model (SMM): mutate by +/- 1 copy of the repeat unit with probability μ
t
CAGMRCA
CAGm CAGn
mm
n
Observed(Sun et al. 2012)
Mean-centered random walk (Ohrnstein-Uhlenbeck):
m
n
μSTR: Mutation rate(per generation)
β: Length constraint(0 ≤ β ≤ 1)
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
β
Estimating the step size distribution
0 +5-5(Mean allele length)
1 2 3 4
0.2
0.4
0.6
0.8
Step size (# units)
Freq
uenc
y
+1 +2 +3 +4
Step size (# units)
0.1
0.2
0.3
0.4
Freq
uenc
y
-1-2-3-4
+1 +2 +3 +4
Step size (# units)
0.1
0.2
0.3
0.4
Freq
uenc
y
-1-2-3-4 +1 +2 +3 +4
Step size (# units)
0.1
0.2
0.3
0.4
Freq
uenc
y
-1-2-3-4p: Probability that the step size is a single unit.
Tetranucleotides: p = ~0.95Dinucleotides: p = ~0.7
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Model validation using Y-STRs
Thomas Willems
Find maximum likelihood mutation parameters(1000 Genomes Project):
P(STR data | Y phylogeny, μ, β, σ)
Validation set:Ballantyne et al (~2,000 father-son pairs)
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Ballantyne, et al.lo
bSTR
r=0.831, N=64
Estimating mutation parameters at autosomal loci
TMRCA
AS
D
0
4
9
16
CAG5
CAG5
Individual 1
CAG5
CAG8
Individual 2
CAGm
CAGn
Individual k
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Per-locus estimation of STR mutation parameters
Estimates for 120K multi-allelic STRs
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
STR mutation trends by motif lengthIntro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Future directions: a genome-wide scan for STR selection
Expected Observed
FeaturesMotif length Recomb. rate
Total length GC content
Linear model
Predict μ, β
Explain: 46% of variation in μ 4.6% of variation in β
Develop genome-wide scan STR selection scan
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Conclusion
The first genome-wide characterization of STR mutation
1. STR mutation model2. Validation against published de novo mutation rates3. Strong effect of local sequence features4. Future work: improve estimation, genome-wide selection
scan
An unexplored, important source of genetic variation
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Yaniv ErlichDavid ReichMark DalyNick PattersonSwapan MallickThomas WillemsAlon Goren
Acknowledgements