Promoter Recognition in silico - cs.helsinki.fi · Recognition – in silico ... of two steps...
-
Upload
trinhthien -
Category
Documents
-
view
236 -
download
0
Transcript of Promoter Recognition in silico - cs.helsinki.fi · Recognition – in silico ... of two steps...
Promoter and other TFBS Recognition – in silico
(mostly sequence based approach)
by
Udyant Kumar
LCE, HUT, ESPOO
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Central Dogma
• Gene expression consists of two stepsTranscription:
DNA à mRNA
Events: DNA packing, DNAmethylation, chromosome puffs, promoter and enhancer regions
Translation
mRNAà Protein
Events: RNA processing, lifetime of mRNA, masked messengers, polypeptide cleaving, metabolic regulation
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Transcription factors:
1) Promoters: Defined as region of DNA immediately upstream of transcription start site to which multiple transcription factors bind at specific sequence boxes to promote initiation of transcription.
2) Enhancers: Other DNA sequences called enhancers are often required for promoter activation. This sequence must be linked to the gene (in cis) but can be found at any position wrt the promoter.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Methods for predicting Transcription factor binding sites:
1) Sequence based : It relies on sequence information obtained from known binding sequences. Usually, consensus sequence patterns orweight matrices are used to scan the database.
2) DG based method: This is based on experimental measurements of binding between protein and DNA. The binding-affinity data for systematic single-base mutations to consensus binding site can be used to derive matrices similar to the weight matrices in the sequence-based method. it requires laborious experiments.
3) Structure based metnod: This is based on the analysis of structural database of protein-DNA complex. We can derive empirical potential functions for the specific interactions between bases and amino acids from the statistical analysis
4) Ab-initio method: This method does not rely on any experimental data, but it is based on computer simulations to derive contact potential between bases and amino acids.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
How transcription factors work:
a) For proteins to bind DNA specifically they must be basic and generally have protuberances that can fit into the double helix major grove to recognise the internal base sequence.
b) Special protein structures are utilized for DNA binding. Examples: Helix turn Helix; Zn++ fingers. Also the dimerization of transcription factors by Leucine Zippers and Helix Loop Helix motifs generates DNA binding structures.
c) Once transcription factor binds to promoter DNA sequence then a separate domain of the protein termed the activation domain may directly interact with PolII or interact with another factor (adaptor) which in turn may interact with PolII.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Experimental methods to find Transcription factors:
Nitrocellulose binding assay;
electrophoretic mobility shift assay(EMSA);
electrophoretic-linked immunosorbent assay(ELISA);
DNA-footprinting
DNA-protein crosslinking (DPC)
reporter conducts
chromatin immunoprecipitation (chIP)
Systematic evolution of ligands by exponential enrichment (SELEX)
Phage display
X-ray crystallography
NMR spectroscopy
This watermark does not appear in the registered version - http://www.clicktoconvert.com
In-silico methods, why?
- Experimental methods are often time-consuming
- In-silico methods provide an extension to present in-vitro methods
Goal:
1) To predict a potential binding sites of known transcription factor
2) discover a sequence motif as well as its putative sites in a collection of long intergenic sequence
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Sequence based methods: It falls in two categories.
1) Profile driven (alignment based): based on pairwise comparisons between input sequences to form common patterns corresponding to well-conserved functional binding sites. approaches: local alignment (Blast, PIPMaker, BBA and DBA); global alignment (clustalW, vista) and phylogenetic footprinting
2) pattern driven (consensus based): a set of real transcription factor binding sites is used to build a characteristic representation or profile from them.
prominant algorithms are:
positive scoring weight matrix (PWM), Gibb’s Sampling, Expectation-Maximization (EM), Multiple discriminant analysis (MDA), Artificial neural Network (ANN), Fixed order Markov model, Hidden markov model
This watermark does not appear in the registered version - http://www.clicktoconvert.com
PWMPWM
• Definition: For a feature of length m using an alphabet of ncharacters, a PWM is an n by m matrix in which each element contains the frequency(score) at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature
• Three uses of PWM
– Describe a sequence feature
– Calculate probability of occurrence of feature in a random sequence
– Calculate degree of match between a new sequence and a feature
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Block Diagram for Building a PSSM
Block Diagram for Building a PSSM
PSSM builder
Set of Aligned Sequence Features
Expected frequencies of each sequence element
PSSM
This watermark does not appear in the registered version - http://www.clicktoconvert.com
PWM:
1) For a given consensus sequence a Weight Matrix is Computed
2) Computed by measuring the frequency of every element of a particular position of the base in a training set
3) Matrix entries can be considered as probabilities
4) Under the weight matrix model, the probability of having a sequence (x1, x2, .., xk) that matches a site is:
If we introduce a measure of the form :
Then, the more LLR(log likelihood ratio) exceeds 0, the better chancesthis sequence is a functional signal
P(X=S) =kQ
i = 1pixi
LLR(X) = Log(P(X=N)P(X=S))
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Strategies for BSTF map construction. Two strategies for constructing maps of binding sites rely on a matrix search for experimentally defined binding sites for transcription factors (BSTF). The first strategy (refined map path) is used to verify the exact location and size of the experimental sites. A second strategy (consistent map path) takes into account both the presence of the experi-mentally verified sites and the matrix score of found matches (more than threshold value)
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Distribution of sites shown for the even-skipped strip 2 region OF Drosophila melanogester
Most of the experimentally verified binding sites shown are shared between the two maps (hits, shown in red). Two known Bicoid sites false-negatives in blue) are missing in the consistent map due to their low positional weight matrix score. In vitro binding assays support the suggestion of low affinity for these two Bicoid sites (Wilson et al. 1996). High-scoring matches (false-positives) to Bicoid, Krüppel, and Giant (TFs mainely enhancers) are shown in green
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Markov chains
• If we can predict all of the properties of a sequence knowing only the conditional dinucleotide probabilities, then that sequence is an example of a Markov chain
• A Markov chain is defined as a sequence of states in which each state depends only on the previous state
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Formalism for Markov chains
• M=(Q,π,P) is a Markov chain, where
• Q = vector (1,..,n) is the list of states
– Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA
• π = vector (p1,..,pn) is the initial probability of each state
– π(i)=pQ(i) (e,g., π(1)=pA for DNA)
• P= n x n matrix where the entry in row i and column j is the probability of observing state j if the previous state is i and the sum of entries in each row is 1 (º dinucleotide probabilities) – P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA)
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Generating Markov chains
• Given Q,π,P (and a random number generator), we can generate sequences that are members of the Markov chain M
• If π,P are derived from a single sequence, the family of sequences generated by M will include that sequence as well as many others
• If π,P are derived from a sampled set of sequences, the family of sequences generated by M will be the population from which that set has been sampled
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Hidden Markov models
• “Hidden” connotes that the sequence is generated by two or more states that have different probability matrices
• pi = state at position i in a path
• akl = P(pi = l | pi-1 = k)
– probabilityof going from one state to another
– “transition probability”
• ek(b) = P(xi = b | pi = k)
– probability of emitting a b when in state k
– “emission probability”
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Goal:
The goal of using an HMM is often to determine (estimate) the sequence of underlying states that likely gave rise to an observed sequence
Algorithms:
• Viterbi algorithm is form of dynamic programming that finds the optimal (most probable) path through a hidden Markov model
• Baum-Welch algorithm finds the transition and emission probabilities for a hidden Markov model given some training examples and a structure for the model
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Example Profile HMM for five aligned tripeptides
From ``Profile hidden Markov models'' Sean R. Eddy, Bioinformatics 14(9):755-63, 1998.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Linear Discriminant Methods
Many functional signals are very short
=> Exploit related characteristics
1. We build a sequence characteristics vector (x1, …,xp)
2. We define and if Z>c then the sequencecorrespond to a site
3. We use a training set to define {ai}, c
4. The training set of (site sequences) define a vector m1 and the (non site sequence) a vector m2
z =P
i=0
p
aixi
a = sà1(m1àm2) c = a(m1 +m2)=2
This watermark does not appear in the registered version - http://www.clicktoconvert.com
1. Choose a set of p characteristics
– Score of the weight matrix
– Distance to a predicted site
– Base composition in distant sequence
2. Test the characteristics with the Mahalonodis distance:
3. Choose the set of q characteristics that maximizes D2
D2 = (m1àm2)sà1(m1àm2)
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Artificial Neural Networks
• Use positive and negative data.
• Can find relations between different positions.
• Iterative training(without the need of prior knowledge for the structure of the solution)
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Tuning parameters
tanh(x) =ex - e-x
ex + e-x
sIE
sI
sEtanh(net)
Simple feedforward ANN trained by the Bayesian regularisation method
wi
net = S si * wi
Tunedthreshold
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Name Techniques used
Features used
SIGNALSCAN PWM TATA box, CAAT box, GC box, TSS, TFBS
MATRIXSEARCH PWM TATA, CAAT, GC, TFBS, TSS
MatInd/ MatInspector PWM TATA, CAAT, GC, TFBS, TSS
ConsInspector Alignment based
TFBS of unlimited length
TFSearch PWM TFBS; TSS
TRANSFAC PWM
This watermark does not appear in the registered version - http://www.clicktoconvert.com
PromoterInspector PWM Promoter
PromoterScan PWM TATA box, TFBS
TSSG/TSSW LDA TATA box, TFBS, hexamerfrequency, TSS
CpGProD LDA CpG island, AT/GC content
CorePromoter/FirstEF
QDA CpG island
CpG Promoter QDA CpG island, TSS
SAMPLER Gibb’sSampling
TFBS, TSS,
This watermark does not appear in the registered version - http://www.clicktoconvert.com
AlignACE Gibb’eSampling
TFBS, TSS
MEME EM TSS, TFBS
Promoter2 NN TATA box, Inr, CAAT box, GC box
DGSF NN CpG island, TSS, DPF
DPF NN Promoter, Exon, Intron, TSS
This watermark does not appear in the registered version - http://www.clicktoconvert.com
McPromoter NN & interpolatedmarkovmodels
TAAT box, CAAT box, GC box, nucleosomeposition
NNPP Time Delay NN
TATA box, Inr
Eponine SVM TATA box, GC box, TSS
Audic/Cleverieapproach
HMM Pol II promoters
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Criteria used to evaluate performance quality:
1) Sensitivity: sensitivity= TP/(TP+FN)
2) Positive predictive value: ppv=TP/(TP+FP)
3) Pearson correlation coefficient:
cc=(TPxTP-FPxFN)/Ö((TP+FP)(TP+FN)(TN+FP)(TN+FN))
TP= True Positive
FP= False Positive
TN= True Negative
FN= False Negative
This watermark does not appear in the registered version - http://www.clicktoconvert.com
performance of most efficient tools for TFBSs recognition: Reference: Promoter Prediction analysis on the whole human genome
Bajic et.al. , Nature Biotechnology, 22, Nov. 2004
Tools Sensitivity(%) PPV(%) CC(%)
CpGProD 37-48 51-70 49-51
DGSF 61-65 62-64 63-64
DPF 53-80 15-32 34-45
FirstEF 79-81 35-40 53-56
Eponine »40 »67 »52
NNPP2.2 69-93 2.0-4.5 15-17
Promoter 2.0 44-57 »4.5 »14
McPromoter2.0 26-57 70-87 -
TSSG/TSSW »29/»42 »72/»59 -
Audic »24 »82 -
This watermark does not appear in the registered version - http://www.clicktoconvert.com
limitations of currently used algorithms:
1) PWM: The matrix table is generated based on the frequency of nucleotides so if dataset is weak it may be weak in few cases. one nuclotide is independent of each other.
2) Gibb’s sampling and EM: less systematic search of initial parameter
3) NN: individual relationship between input variables and output variables are not developed by engineering judgement. minimizingoverfitting also takes a lot of computational efforts.
4) Discriminant analysis: Covariance matrix may have undesirable properties.
5) phylogenetic footprinting: doesn’t work for very short and very long sequences
6) Fixed order Markov model: for short and medium sequences works fine but for very long sequences it is difficult to make higher order markov chain,
7) Hidden markov model: it is difficult to implement additional hidden states.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Problems with sequence based methods:
1) Works well with CpG island associated transcription factors. no solution for non-CpG island associated TFs.
2) Doesn’t give idea about physicochemical forces between DNA.protein interactions.
3) doesn’t give any clue of DNA bendability during interaction between DNA-TFs
4) no idea for effect of binding of first TF on the position of another proximal TF which may lead to multiple transcription in genome.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Structure based methods:
two categories of strucutre based approaches:
1) Statistical potentials
2) potentials obtainefd from molecular mechanics
Statistical potentials are derived from systematic analysis of structural protein-DNA complexes. Pairwise potentials are extracted from distributions of atoms around. DNA bases of known protein-DNA complexes, which reflect the statistical occurence of specific interactions. The free energy interaction map between pairs of bases and amino acid is used to simulate the different bonds e.g. hydrogen bond, disulphide bond, c-c bonds etc. By calculating physical forces it is possible to calculate 3D-folding between DNA-protein complexes and can be used to evaluate the DNA coperativity whichwill lead to know multiple transcription of DNA.
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Advatages:
1) Easy to find transcription factors associated with CpG and non-CpG islands both
2) easy to predict DNA bendability
3) easy to predict co-occurence of TFs on DNA where one TF affects the occurence of another proximal DNA
Disadvatages of Structure based predictions:
1) High computing power
2) accounting for whole-system interactions
This watermark does not appear in the registered version - http://www.clicktoconvert.com
Conclusion:
1) Sequence-based methods are abundant but are poor in one or some area. needs improvement.
2) Structure based methods needs high computation power
3) better to use a good number of highly efficient sequence-based methods to predict TFs.
4) use of strucutre based methods as a final evaluation of predicting binding sites. a combination of sequence-based and structure-based methods can be used to predict putative binding sites
This watermark does not appear in the registered version - http://www.clicktoconvert.com