Promoter Recognition in silico - cs.helsinki.fi · Recognition – in silico ... of two steps...

Promoter and other TFBS Recognition – in silico

(mostly sequence based approach)

by

Udyant Kumar

LCE, HUT, ESPOO

This watermark does not appear in the registered version - http://www.clicktoconvert.com

http://www.clicktoconvert.com

Central Dogma

• Gene expression consists of two stepsTranscription:

DNA à mRNA

Events: DNA packing, DNAmethylation, chromosome puffs, promoter and enhancer regions

Translation

mRNAà Protein

Events: RNA processing, lifetime of mRNA, masked messengers, polypeptide cleaving, metabolic regulation



Transcription factors:

1) Promoters: Defined as region of DNA immediately upstream of transcription start site to which multiple transcription factors bind at specific sequence boxes to promote initiation of transcription.

2) Enhancers: Other DNA sequences called enhancers are often required for promoter activation. This sequence must be linked to the gene (in cis) but can be found at any position wrt the promoter.



Methods for predicting Transcription factor binding sites:

1) Sequence based : It relies on sequence information obtained from known binding sequences. Usually, consensus sequence patterns orweight matrices are used to scan the database.

2) DG based method: This is based on experimental measurements of binding between protein and DNA. The binding-affinity data for systematic single-base mutations to consensus binding site can be used to derive matrices similar to the weight matrices in the sequence-based method. it requires laborious experiments.

3) Structure based metnod: This is based on the analysis of structural database of protein-DNA complex. We can derive empirical potential functions for the specific interactions between bases and amino acids from the statistical analysis

4) Ab-initio method: This method does not rely on any experimental data, but it is based on computer simulations to derive contact potential between bases and amino acids.



How transcription factors work:

a) For proteins to bind DNA specifically they must be basic and generally have protuberances that can fit into the double helix major grove to recognise the internal base sequence.

b) Special protein structures are utilized for DNA binding. Examples: Helix turn Helix; Zn++ fingers. Also the dimerization of transcription factors by Leucine Zippers and Helix Loop Helix motifs generates DNA binding structures.

c) Once transcription factor binds to promoter DNA sequence then a separate domain of the protein termed the activation domain may directly interact with PolII or interact with another factor (adaptor) which in turn may interact with PolII.



Experimental methods to find Transcription factors:

Nitrocellulose binding assay;

electrophoretic mobility shift assay(EMSA);

electrophoretic-linked immunosorbent assay(ELISA);

DNA-footprinting

DNA-protein crosslinking (DPC)

reporter conducts

chromatin immunoprecipitation (chIP)

Systematic evolution of ligands by exponential enrichment (SELEX)

Phage display

X-ray crystallography

NMR spectroscopy



In-silico methods, why?

- Experimental methods are often time-consuming

- In-silico methods provide an extension to present in-vitro methods

Goal:

1) To predict a potential binding sites of known transcription factor

2) discover a sequence motif as well as its putative sites in a collection of long intergenic sequence



Sequence based methods: It falls in two categories.

1) Profile driven (alignment based): based on pairwise comparisons between input sequences to form common patterns corresponding to well-conserved functional binding sites. approaches: local alignment (Blast, PIPMaker, BBA and DBA); global alignment (clustalW, vista) and phylogenetic footprinting

2) pattern driven (consensus based): a set of real transcription factor binding sites is used to build a characteristic representation or profile from them.

prominant algorithms are:

positive scoring weight matrix (PWM), Gibb’s Sampling, Expectation-Maximization (EM), Multiple discriminant analysis (MDA), Artificial neural Network (ANN), Fixed order Markov model, Hidden markov model



PWMPWM

• Definition: For a feature of length m using an alphabet of ncharacters, a PWM is an n by m matrix in which each element contains the frequency(score) at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

• Three uses of PWM

– Describe a sequence feature

– Calculate probability of occurrence of feature in a random sequence

– Calculate degree of match between a new sequence and a feature



Block Diagram for Building a PSSM

Block Diagram for Building a PSSM

PSSM builder

Set of Aligned Sequence Features

Expected frequencies of each sequence element

PSSM



PWM:

1) For a given consensus sequence a Weight Matrix is Computed

2) Computed by measuring the frequency of every element of a particular position of the base in a training set

3) Matrix entries can be considered as probabilities

4) Under the weight matrix model, the probability of having a sequence (x1, x2, .., xk) that matches a site is:

If we introduce a measure of the form :

Then, the more LLR(log likelihood ratio) exceeds 0, the better chancesthis sequence is a functional signal

P(X=S) =kQ

i = 1pixi

LLR(X) = Log(P(X=N)P(X=S))



Strategies for BSTF map construction. Two strategies for constructing maps of binding sites rely on a matrix search for experimentally defined binding sites for transcription factors (BSTF). The first strategy (refined map path) is used to verify the exact location and size of the experimental sites. A second strategy (consistent map path) takes into account both the presence of the experi-mentally verified sites and the matrix score of found matches (more than threshold value)



Distribution of sites shown for the even-skipped strip 2 region OF Drosophila melanogester

Most of the experimentally verified binding sites shown are shared between the two maps (hits, shown in red). Two known Bicoid sites false-negatives in blue) are missing in the consistent map due to their low positional weight matrix score. In vitro binding assays support the suggestion of low affinity for these two Bicoid sites (Wilson et al. 1996). High-scoring matches (false-positives) to Bicoid, Krüppel, and Giant (TFs mainely enhancers) are shown in green



Markov chains

• If we can predict all of the properties of a sequence knowing only the conditional dinucleotide probabilities, then that sequence is an example of a Markov chain

• A Markov chain is defined as a sequence of states in which each state depends only on the previous state



Formalism for Markov chains

• M=(Q,π,P) is a Markov chain, where

• Q = vector (1,..,n) is the list of states

– Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA

• π = vector (p1,..,pn) is the initial probability of each state

– π(i)=pQ(i) (e,g., π(1)=pA for DNA)

• P= n x n matrix where the entry in row i and column j is the probability of observing state j if the previous state is i and the sum of entries in each row is 1 (º dinucleotide probabilities) – P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA)



Generating Markov chains

• Given Q,π,P (and a random number generator), we can generate sequences that are members of the Markov chain M

• If π,P are derived from a single sequence, the family of sequences generated by M will include that sequence as well as many others

• If π,P are derived from a sampled set of sequences, the family of sequences generated by M will be the population from which that set has been sampled



Hidden Markov models

• “Hidden” connotes that the sequence is generated by two or more states that have different probability matrices

• pi = state at position i in a path

• akl = P(pi = l | pi-1 = k)

– probabilityof going from one state to another

– “transition probability”

• ek(b) = P(xi = b | pi = k)

– probability of emitting a b when in state k

– “emission probability”



Goal:

The goal of using an HMM is often to determine (estimate) the sequence of underlying states that likely gave rise to an observed sequence

Algorithms:

• Viterbi algorithm is form of dynamic programming that finds the optimal (most probable) path through a hidden Markov model

• Baum-Welch algorithm finds the transition and emission probabilities for a hidden Markov model given some training examples and a structure for the model



Example Profile HMM for five aligned tripeptides

From ``Profile hidden Markov models'' Sean R. Eddy, Bioinformatics 14(9):755-63, 1998.



Linear Discriminant Methods

Many functional signals are very short

=> Exploit related characteristics

1. We build a sequence characteristics vector (x1, …,xp)

2. We define and if Z>c then the sequencecorrespond to a site

3. We use a training set to define {ai}, c

4. The training set of (site sequences) define a vector m1 and the (non site sequence) a vector m2

z =P

i=0

p

aixi

a = sà1(m1àm2) c = a(m1 +m2)=2



1. Choose a set of p characteristics

– Score of the weight matrix

– Distance to a predicted site

– Base composition in distant sequence

2. Test the characteristics with the Mahalonodis distance:

3. Choose the set of q characteristics that maximizes D2

D2 = (m1àm2)sà1(m1àm2)



Artificial Neural Networks

• Use positive and negative data.

• Can find relations between different positions.

• Iterative training(without the need of prior knowledge for the structure of the solution)



Tuning parameters

tanh(x) =ex - e-x

ex + e-x

sIE

sI

sEtanh(net)

Simple feedforward ANN trained by the Bayesian regularisation method

wi

net = S si * wi

Tunedthreshold



Name Techniques used

Features used

SIGNALSCAN PWM TATA box, CAAT box, GC box, TSS, TFBS

MATRIXSEARCH PWM TATA, CAAT, GC, TFBS, TSS

MatInd/ MatInspector PWM TATA, CAAT, GC, TFBS, TSS

ConsInspector Alignment based

TFBS of unlimited length

TFSearch PWM TFBS; TSS

TRANSFAC PWM



PromoterInspector PWM Promoter

PromoterScan PWM TATA box, TFBS

TSSG/TSSW LDA TATA box, TFBS, hexamerfrequency, TSS

CpGProD LDA CpG island, AT/GC content

CorePromoter/FirstEF

QDA CpG island

CpG Promoter QDA CpG island, TSS

SAMPLER Gibb’sSampling

TFBS, TSS,



AlignACE Gibb’eSampling

TFBS, TSS

MEME EM TSS, TFBS

Promoter2 NN TATA box, Inr, CAAT box, GC box

DGSF NN CpG island, TSS, DPF

DPF NN Promoter, Exon, Intron, TSS



McPromoter NN & interpolatedmarkovmodels

TAAT box, CAAT box, GC box, nucleosomeposition

NNPP Time Delay NN

TATA box, Inr

Eponine SVM TATA box, GC box, TSS

Audic/Cleverieapproach

HMM Pol II promoters



Criteria used to evaluate performance quality:

1) Sensitivity: sensitivity= TP/(TP+FN)

2) Positive predictive value: ppv=TP/(TP+FP)

3) Pearson correlation coefficient:

cc=(TPxTP-FPxFN)/Ö((TP+FP)(TP+FN)(TN+FP)(TN+FN))

TP= True Positive

FP= False Positive

TN= True Negative

FN= False Negative



performance of most efficient tools for TFBSs recognition: Reference: Promoter Prediction analysis on the whole human genome

Bajic et.al. , Nature Biotechnology, 22, Nov. 2004

Tools Sensitivity(%) PPV(%) CC(%)

CpGProD 37-48 51-70 49-51

DGSF 61-65 62-64 63-64

DPF 53-80 15-32 34-45

FirstEF 79-81 35-40 53-56

Eponine »40 »67 »52

NNPP2.2 69-93 2.0-4.5 15-17

Promoter 2.0 44-57 »4.5 »14

McPromoter2.0 26-57 70-87 -

TSSG/TSSW »29/»42 »72/»59 -

Audic »24 »82 -



limitations of currently used algorithms:

1) PWM: The matrix table is generated based on the frequency of nucleotides so if dataset is weak it may be weak in few cases. one nuclotide is independent of each other.

2) Gibb’s sampling and EM: less systematic search of initial parameter

3) NN: individual relationship between input variables and output variables are not developed by engineering judgement. minimizingoverfitting also takes a lot of computational efforts.

4) Discriminant analysis: Covariance matrix may have undesirable properties.

5) phylogenetic footprinting: doesn’t work for very short and very long sequences

6) Fixed order Markov model: for short and medium sequences works fine but for very long sequences it is difficult to make higher order markov chain,

7) Hidden markov model: it is difficult to implement additional hidden states.



Problems with sequence based methods:

1) Works well with CpG island associated transcription factors. no solution for non-CpG island associated TFs.

2) Doesn’t give idea about physicochemical forces between DNA.protein interactions.

3) doesn’t give any clue of DNA bendability during interaction between DNA-TFs

4) no idea for effect of binding of first TF on the position of another proximal TF which may lead to multiple transcription in genome.



Structure based methods:

two categories of strucutre based approaches:

1) Statistical potentials

2) potentials obtainefd from molecular mechanics

Statistical potentials are derived from systematic analysis of structural protein-DNA complexes. Pairwise potentials are extracted from distributions of atoms around. DNA bases of known protein-DNA complexes, which reflect the statistical occurence of specific interactions. The free energy interaction map between pairs of bases and amino acid is used to simulate the different bonds e.g. hydrogen bond, disulphide bond, c-c bonds etc. By calculating physical forces it is possible to calculate 3D-folding between DNA-protein complexes and can be used to evaluate the DNA coperativity whichwill lead to know multiple transcription of DNA.



Advatages:

1) Easy to find transcription factors associated with CpG and non-CpG islands both

2) easy to predict DNA bendability

3) easy to predict co-occurence of TFs on DNA where one TF affects the occurence of another proximal DNA

Disadvatages of Structure based predictions:

1) High computing power

2) accounting for whole-system interactions



Conclusion:

1) Sequence-based methods are abundant but are poor in one or some area. needs improvement.

2) Structure based methods needs high computation power

3) better to use a good number of highly efficient sequence-based methods to predict TFs.

4) use of strucutre based methods as a final evaluation of predicting binding sites. a combination of sequence-based and structure-based methods can be used to predict putative binding sites



Promoter Recognition in silico - cs.helsinki.fi · Recognition – in silico ... of two steps...

Documents

Transcript of Promoter Recognition in silico - cs.helsinki.fi · Recognition – in silico ... of two steps...