Modeling Motifs Collecting Data - FASTA · Modeling Motifs –Collecting Data (Measuring and...
Transcript of Modeling Motifs Collecting Data - FASTA · Modeling Motifs –Collecting Data (Measuring and...
10/29/2016
1
Modeling Motifs – Collecting Data(Measuring and Modeling
Specificity of Protein-DNA Interactions)
Computational Genomics CourseCold Spring Harbor LabsOct 31, 2016
Gary D. StormoDepartment of Genetics
Outline
• Modeling specificity with a position weight matrix (PWM)
– General features
– Limitations and extensions
• How to set the weights
– General ideas, some history
– Using high-throughput experimental data
– Using in vivo location data (Chip-chip/seq)
10/29/2016
2
Terminology: Sites vs Motifs{Sites} Motif
Think restriction sites
EcoRI: {GAATTC} GAATTC
HincII {GTTAAC,GTTGAC,GTCAAC,GTCGAC} GTYRAC
Transcription factor motifs should be quantitative, give different scores to different sites, reflecting
differences in binding affinity
Also: site is specific location in genome
Representations/Models
of Protein-DNA binding
•Transcription factors don’t bind to just one sequence
•A “Consensus sequence” is usually the preferred site, but similar sequences also bind well
•Not all variants bind equally well; some positions contribute more to the specificity than others
10/29/2016
3
Position Weight Matrix Model(PWM, also PSSM)
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
PWM Model
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
….A C T A T A A T G T …
Score = -24
10/29/2016
4
PWM Model
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
….A C T A T A A T G T …
Score = 43
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
10/29/2016
5
PWM Model
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
PWM is a generalization of consensus sequence.There is NO advantage in consensus sequences.Given a consensus sequence one can define a PWMand a threshold that will return the same sites.
PWM Model
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
( )i iScore S W S
PWM is a linear model:• Si encodes the sequence (which base occurs at each position)• W weights those encoded features to provide the score• Easy to add more features if they are necessary
10/29/2016
6
Two important issues need to be addressed
• Parameter estimation: Where do the matrix elements come from? Different types of data lead to different methods of parameter estimation.
• Additivity: do the positions really contribute independently to the binding interaction? If not, how to we extend the model?
Complete binding energy list vs model.
10/29/2016
7
-0.7901.25TT
↓↓↓↓
-0.790-0.13AT
1.380.42-1.21AG
1.38-0.420AC
1.380.420.83AA
321
↓
-0.791.25TTT
↓↓↓
-0.790.83AAT
1.381.25AAG
1.380.41AAC
1.381.25AAA
21
If simple additive model is inadequate, can use di-nucleotide or higher-order models. Some form of a matrix model must be correct because binding the binding data itself is a matrix (vector).
Alternative approach to higher-order contributions: structure parameters
Maybe the non-additivity is due to structural preferencesknown to be dependent on nearest neighbor bases (or longer)
May capture context effects with fewer parameters
For example, see work by Rohs and colleagues:Covariation between homeodomain transcription factors and the shape of their DNA binding sites. Nucleic Acids Res. 2014 42:430-41TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014 42(Database issue):D148-55.Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A. 2015 112(15):4654-9.DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Syst. 2016 Sep 28;3(3):278-286
10/29/2016
8
How to Set the Matrix Elements
• Statistical treatment of known sites. Need a reasonable sample size. Some assumptions about how the sample is obtained.
- probabilistic model is easy, can be accurate if assumptions are reasonable
• Quantitative binding data: determine matrix parameters that provide the best fit.
- Has been laborious and slow experimental work, but new technologies make this much easier
N(b,i)
F(b,i)
W(b,i) = log[F(b,i)/P(b)]
I(i) = ∑F(b,i)W(b,i)
Modeling based on known sites
Log-oddsPWM
PFM or PPMNote: some papersand programs callthis a PWM
10/29/2016
9
Classic Logo (from Tom Schneider): Height of column at each position is Information ContentEach base in proportion to its frequency
Likelihood Ratio Statistics Primer
Given two probability distributions Pi and QI
∑ Pi = ∑Qi = 1
And some data, Di, which is number of times each type i is observed in N total observations
The Likelihood Ratio of the data being from distributionQi versus Pi is:
LR = ∏ (Qi/Pi)Di
And the log-Likelihood Ratio isLLR = ∑ Di ln (Qi/Pi)
10/29/2016
10
LLR = ∑ Di ln (Qi/Pi)
Maximum likelihood distribution is Qi = Di/N
So max LLR = N ∑ Qi ln (Qi/Pi)
∑ Qi ln (Qi/Pi) ≥ 0
≡ Information ContentRelative EntropyKullbach-Liebler Distance
Related to G-statistic and χ2
Modeling from experimental data
• From single binding site experiments to high-throughput methods that allow for the determination of specificity (relative affinity) across all possible sequences at once
10/29/2016
11
QuantitativeBinding Affinityof TF for one sequence
Specificity
Refers to the relativeAffinity to differentSequences, ideally toAll sequences
10/29/2016
12
SpecificityModeling
High-throughput experimental methods to Measure TF specificity
High-throughput in vitro binding site analyses
• Can give good, quantitative models of intrinsic binding specificity
• More data alone isn’t sufficient to give better models, also need good analysis methods
• Log-odds method is based on assumptions that may not be true
• Energetic models can give better descriptions– Non-linear relationship between binding affinity
and binding probability at high TF concentration
10/29/2016
13
Log-odds method is equivalent to an energy model if the sites are from a Boltzmann
distribution with binding probability ∝ 𝒆−𝑬
( ) ( ) /
( )ln
( )
iE
i i
ii
i
F S P S e Z
F SE
P S
posterior prior
Log-odds relationship between binding energy and frequencies
energy
Reality is a Fermi-Dirac distribution with Boltzmann a special case at the low concentration range
Djordjevic et al, Genome Res. 2003 13:2381-90.
10/29/2016
14
Additive changesin binding energyhave non-independent(context dependent)effects on bindingprobability
Probabilities nolonger factor, eventhough energiesare additive
EG-EA=2kT
GTGGA vs ATGGA
GTGTA vs ATGTA
HT-SELEX (SELEX-Seq)
2min [ ( ) ( )]
( ) ( )1 i
i i
i
i i
N S n S
aN S b n S data
e
W S
Parameters to fit: a, b, W, μ
10/29/2016
15
Fit of model to HT-SELEX data for zif268BEEML vs BioProspector
Zhao et al, PLoS Comp Bio, 2009
Protein Binding Microarray (PBM)
10/29/2016
16
Example of Plag1 using BEEML-PBM
Nat Biotechnol. 2013 31:126-34.
• Most TFs (~90%) fit well by PWMs• BEEML-PBM among the best methods• Some do better with di-nuc models• A few require multiple modes of interaction• Best models fit in vivo data as well as in vivo-derived models
Zhao and Stormo, Nature Biotechnol. 2011 29:480-483
10/29/2016
17
Weirauch et al
Diverse sets:
>100 TFs
~20 TFs
~240 TFs
>1000 TFs
Bacterial-1-Hybrid (B1H)
10/29/2016
18
B1H on zif268 returns the expected model
10/29/2016
19
Average Prediction Accuracy for ZFPs
http://stormo.wustl.edu/ZFModels/
HT-SELEX (SELEX-Seq)
𝑃(𝑆𝑖|𝑏)
𝑃(𝑆𝑖)∝ ൗ1 1+𝑒𝐸𝑖−𝜇
Compared to reference sequence with E = 0
𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖
𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓
=1+𝑒−𝜇
1+𝑒𝐸𝑖−𝜇
10/29/2016
20
Spec-seq (specificity by sequencing)
𝑃(𝑆𝑖|𝑏)
𝑃(𝑆𝑖|𝑢)= 𝑒𝜇−𝐸𝑖
Compared to reference sequence with E = 0
𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓
𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖
= 𝑒𝐸𝑖 → 𝑙𝑛
𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓
𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖
= 𝐸𝑖
𝑲𝑨 𝑺𝟏 : 𝑲𝑨 𝑺𝟐 : … :𝑲𝑨 𝑺𝒏
=𝐏 ∙ 𝑺𝟏𝑺𝟏
:𝐏 ∙ 𝑺𝟐𝑺𝟐
: … :𝐏 ∙ 𝑺𝒏𝑺𝒏
𝐏 + 𝑺𝒊 ↔ 𝐏 ∙ 𝑺𝒊
𝑲𝑨(𝑺𝒊) =[𝐏 ∙ 𝑺𝒊]
𝑷 [𝑺𝒊]
Spec-seq: Specificity bysequencing
10/29/2016
21
Specificity of theLac repressor
WT operator isasymmetric
4 libraries: vary both sequence and spacing
2560 different bindingsites
Highly reproducible:~5% variance in affinity~0.1kT variance in energy
Zuo and Stormo, Genetics, 2014
Three‐dimensional structure of the
dimeric lac HP62–O1 operator complex.
Kalodimos C G et al. EMBO J. 2002;21:2866-2876
©2002 by European Molecular
Biology Organization
10/29/2016
22
No motif for half of all human TFs –
Most are C2H2 zinc finger proteins
Laura Campitelli
No motif for half of all human TFs –
Most are C2H2 zinc finger proteins
Matt Weirauch
Known
motif
(637)No motif
(809)
Close
ortholog/paralog
has motif
(219)
C2H2 with
No motif
(573)
Possibly not
sequence-
specific
(143)
Needs
hetero-
dimerization
partner (56)
Not tried/
no data
(37)
Known
motif
(97)
No motif
(573)
Close
ortholog/par
alog
has motif
(44)
Human – all TFs
(1,665)
Human – no motif
(809)
Human – all C2H2s
(714)
10/29/2016
23
ZF specificity predictionUse three programs: ours, One from Princeton group,One from Toronto group
The Logos look pretty differentbut that is largely quantitative,and there are many high IC positions of agreement.By averaging the PFMs one can obtain a consensus sequence that agrees pretty well with all three.
Reverse Consensus (30bp)
TCTTGATGATGCTGCAATATTAATAATTTASpec-seq randomizations:
Consensus is “goodenough” to showshift in EMSA.
So we randomizedfive adjacent positionsat a time, generating6 libraries of 1024 sequences. MergedLogo shows overall good match with consensus and provides quantitativepredictions about binding energy contributions.
10/29/2016
24
Spec-seq motif matches well with motif obtained from invivo recombination hotspotsand using Affinity-seq method
Affinity-seq pulls out genomicDNA fragments in vitro andsequences them withoutAmplification.
Affinity-seq motif
Hotspot motif
Can also be easily adapted to study CpG methylation sensitivityZFP57 involved in Imprinting MaintenanceHas 2 ZF clusters, one binds TGCCGC, prefers mCpG3 libraries with random regions and methylation variants
10/29/2016
25
𝑲𝐗|𝐘 𝒙𝟏 :𝑲𝐗|𝐘 𝒙𝟐 : … :𝑲𝐗|𝐘 𝒙𝒏 =𝑵 𝒙𝟏 𝑩𝐗,𝐘
𝑵 𝒙𝟏 𝑩−,𝐘:𝑵 𝒙𝟐 𝑩𝐗,𝐘
𝑵 𝒙𝟐 𝑩−,𝐘: … :
𝑵(𝒙𝒏|𝑩𝐗,𝐘)
𝑵(𝒙𝒏|𝑩−,𝐘)
𝝎𝒊 =𝑲𝐗|𝐘 𝑺𝒊
𝑲𝐗 𝑺𝒊=
𝑲𝐘|𝐗 𝑺𝒊
𝑲𝐘 𝑺𝒊=
𝑲𝐗,𝐘 𝑺𝒊
𝑲𝐗 𝑺𝒊 𝑲𝐘 𝑺𝒊
𝑲𝐗 𝒙𝟏 :𝑲𝐗 𝒙𝟐 : … :𝑲𝐗 𝒙𝒏 =𝑵 𝒙𝟏 𝑩𝐗,−
𝑵 𝒙𝟏 𝑩−,−:𝑵 𝒙𝟐 𝑩𝐗,−
𝑵 𝒙𝟐 𝑩−,−: … :
𝑵(𝒙𝒏|𝑩𝐗,−)
𝑵(𝒙𝒏|𝑩−,−)
Spec-seq for combinatorial bindingcan get all of the important parametersin one experiment, including cooperativity
Stormo, Zuo, Chang, Briefings in Functional Genomics, 2015
SpecificityModeling
Conclusions: 1. Different types of high-throughput data can be used to obtain
good specificity models; good analysis methods are critical2. PWMs are often (usually?) good approximations, but higher
order models can be obtained if needed
10/29/2016
26
Discovery of Binding Motifs from in vivo data
Datatypes for Motif Discovery• Co-regulated genes
– Genetic studies (deletion, over-expression effects)– Expression analysis (microarrays, RNA-Seq)
• Co-bound regions– ChIP-chip/-Seq location analysis
• Phylogenetic analysis, conservation across species– “phylogenetic footprinting”– Can be combined with multigene analysis, even over the whole genome
Goal: Find the “most significant” pattern in common• Can’t look at all possible alignments – too many• In vitro analysis methods don’t work; assumptions not valid
Outline of problem
10/29/2016
27
CE1CG
\TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\
ECOARABOP
\GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\
ECOBGLR1
\ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\
ECOCRP
\CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\
ECOCYA
\ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\
ECODEOP2
\AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\
ECOGALE
\GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\
ECOILVBPR
\GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\
ECOLAC
\AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\
ECOMALBA
\ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\
ECOMALBA
\GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\
ECOMALT
\GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\
ECOOMPA
\GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\
ECOTNAA
\TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\
ECOUXU1
\CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\
PBR322
\CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\
TRN9CAT
\CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\
TDC
\GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\
Example dataset: promoter region from co-regulated genes
10/29/2016
28
Expectation Maximization (EM)Approach to Motif Discovery
Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their
probability- Make initial guess, then iterate between those steps
until convergenceAlgorithm:• Initial PWM from average of all possible sites• Using current PWM estimate probability of each
position being site; make new PWM from weighted average of all sites
• Iterate to convergence; usually fast, no guarantee of optimal
Gibbs’ Sampling Approach to Motif Discovery
Same Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their
probability- Iterate between those steps until convergenceAlgorithm:• Pick 1 site from N-1 sequences, make PWM• Use “pseudocounts” to avoid prob. = 0• Use current PWM to pick site from left out sequence
by sampling from probability disturbition; update PWM• Iterate to convergence; run multiple times, compare
results; still no guarantee of optimal but avoids local optima often obtained with EM
10/29/2016
29
From Lawrence et al, (1993) Science 1 262:208-14.
A B
Motif discovery from co-regulated genes
Single species Multiple species
10/29/2016
30
Example – Leu3
Alignment of profiles
A . . 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . .C . . 0 1 1 2 4 2 4 4 4 0 0 0 0 0 4 0 0 0 4 4 0 0 4 0 0 0 0 . .G . . 1 0 0 0 0 0 0 0 0 4 4 0 0 4 0 4 4 4 0 0 3 0 0 4 0 0 0 . .T . . 3 3 3 2 0 2 0 0 0 0 0 4 0 0 0 0 0 0 0 0 1 4 0 0 4 4 4 . .
A . . 0 0 4 2 0 0 0 0 0 0 0 0 4 4 0 0 0 1 0 0 0 3 1 0 0 3 4 . .C . . 0 2 0 1 0 0 0 4 4 0 0 0 0 0 4 0 0 1 4 1 0 1 0 0 1 0 0 . .G . . 0 0 0 0 4 4 0 0 0 4 4 0 0 0 0 4 4 0 0 2 0 0 3 1 3 1 0 . .T . . 4 2 0 1 0 0 4 0 0 0 0 4 0 0 0 0 0 2 0 1 4 0 0 3 0 0 0 . .
A . . 0 2 1 1 0 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1 2 3 0 0 1 1 . .C . . 3 0 1 1 4 0 0 4 4 0 0 0 0 4 4 0 0 4 0 0 0 0 1 1 0 3 2 . .G . . 1 1 2 0 0 0 4 0 0 4 4 0 0 0 0 4 4 0 0 0 3 2 0 0 1 0 1 . . T . . 0 1 0 2 0 3 0 0 0 0 0 4 0 0 0 0 0 0 4 4 0 0 0 3 3 0 0 . .
YGL125W
YOR108W
YMR108W
S. cerevisiae GAAAAAATAACAGCGACTTTTCTCCCGGTAGCGGGCCGTCGTTTAGTCATTCTATCCCTCS. mikatae AAAACATAACAGCGAATTTTCCTCCCGGTAGCGGGCCTTCGTTTAGTCATTCTCTCTCTTS. bayanus AAAAAATAACAGCGACTTTTCCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCCCS. kudriavzevii GAAAAAAAACAACGGCGGCCTCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCTC
***** **** ** *** * *************************************
YGL125W
S. cerevisiae GCCATCATGGTCCGGTAACGGTCGTAGTGAATGACTCATATTTTTCCATCTCTTTS. mikatae GCCATCAAGGTCCGGTAACGGTCGTAGTGAATGACTCACATTTTCTTCGTTATTCS. bayanus ACCATTACGGTCCGGTAACGGACTTAGTGAATGATTCATCTTTTCTTCTTTTTTCS. kudriavzevii GTCGTTAAGGTCCGGTAACGGCCCTCAGCGAATGATTCATAATTTCATTTTTTTC
***** * ************* * ********** *** **** *** ***
YOR108W
S. cerevisiae AACGCCTAGCCGCCGGAGCCTGCCGGTACCGGCTTGGCTTCAGTTGCTGATCTCGGS. mikatae CACAATGACACATACCTAACAGCCGGTACCGGCTTGAATGCCGCCGTTGGCTTCGGS. bayanus ATCTTCTAGTCACCGCAGTCTGCCGGTACCGGCTTGAATTCCGCCGTTGATCCTGGS. kudriavzevii CACATCTCTAGTCCGCGCTCTGCCGGTACCGGCTTAGACTAGCCACGAATCTCGGC
** *** * **** ***************** **** ** * ** **
YMR108W
Alignment of conserved regions
Wang and Stormo, Bioinformatics 2003
Even whole genome search for conserved, multi-copy elements (eg. PhyloNet)
Wang and Stormo, PNAS 2005