Computaonal Approaches to Promoter Analysis Part II€¦ · Word based methods • Try to define...

Post on 28-May-2020

4 views 0 download

Transcript of Computaonal Approaches to Promoter Analysis Part II€¦ · Word based methods • Try to define...

Computa(onalApproachestoPromoterAnalysis

PartII

ShifraBen‐Dor June2010

Firstdefineyourques(on

•  Areyoulookingforthetranscrip(onstartsite?

•  Areyouinterestedintranscrip(onfactorbindingsites?

•  Areyouinterestedinproximalordistalsignals?

•  LocateTSS

•  Iden(fytranscrip(onfactorbindingsites

•  Characterizeregulatoryproper(es

TSSiden(fica(on

•  ThereareprogramsthataKempttoiden(fyTSS

•  Thesearebasedondifferentalgorithms

•  Thebestistousewhatevertranscrip(onalinforma(onisavailable

•  ExpertsPick:UCSCGenomeBrowser

Fickett JW, Hatzigeorgiou AG. Genome Res 1997 Sep;7(9):861-78

Evaluation of several programs ability to predict TSS If no tss, then 3’ end of promoter region Correct if 200 bp 5’ or 100 bp 3’ of actual

Both studies found that the absolute measures of correctness of all programs were quite low, although Hu et al. found that the algorithms which they tested were capable of predicting at least one binding site accurately more than 90% of the time.

MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS, Sept. 2009, p. 481–509 Vol. 73, No. 3

•  Sequencebasedmethods

•  Wordbasedmethods

•  Phylogene(cfootprin(ng

Transcription factor binding sites:

Sequencebasedmethods

•  Lookforconsensussequencesthatareknowntobeinpromoters

– PromoterelementssuchasTATA,CCAAT

– Transcrip(onFactorbindingsites

•  Comparetoexis(ngpromoters

•  Bothrequiredatabases,eitherofpromoters,

promoterelementsorbindingsites

Corepromoterconservedsequences

•  TATAbox TATAA(AA)

•  InrBoxMammals: PyPy(C)A+1NT/APyPy

Drosophila: TCA+1G/TTC/T

•  DPE(A+24) A/G+28G,A/T,C/T,G/A/T

•  MTE CSARCSSAACGS

•  BRE G/C,G/C,G/A,CGCC

•  CpGIslands

Wheredowegetthesequencesfrom?

•  EMSA

•  Fingerprin(ng•  In‐vitroselec(onorSELEX(Systema(cEvolu(onof

LigandsbyEXponen(alenrichment)

•  High‐throughputbindingtooligos•  Computa(onalanalysisofsuspectedco‐regulated

genes

In‐vitroSelec(on:SELEX

•  StartfromalargelibraryofdsDNAoligos(20‐30bp)

•  Allowproteintobind•  Selectsequencesthatdobind(e.g.column)•  Releasethemfrombinding

•  Amplifythesequencesofthebinders•  Repeat5‐15(mes•  Alignfinalsequences

Typesofsequence

•  Specificsequence

•  Consensussequence– Majority–  IUPAC(degenerate)

•  Posi(onWeightMatrices(PWM)

ACGTCG TGGTAG ATGTAG ATGTAG

ATGTAG

WBGTMG

1 2 3 4 5 6 A 3 0 0 0 3 0C 0 1 0 0 1 0 G 0 1 4 0 0 4T 1 2 0 4 0 0

Posi(onWeightMa(x(PWM)

•  Assignsaweighttoeachpossiblenucleo(deateachposi(onofabindingsite

•  Thesumoftheweightsisthesitescore

•  Basis:differentposi(onsinthesitecontributeindependentlytobinding

•  Cangiveascoretomatches

Databases

•  EPD(Eukaryo(cPromoterDatabase)

•  Transfac

•  Matbase

•  JASPAR

•  TRED

•  TFD(Transcrip(onFactorDatabase)

•  TRRD(Transcrip(onRegulatoryRegionDatabase)

•  dbTSS(DatabaseofTranscrip(onStartSites)

PlantSpecificDatabases

•  Place

•  PlantCare

•  PlantPromDB

Problemswithsequence‐basedsearches

•  Notallbindingsitesdefined•  Samesitemightbinddifferentfactorsindifferentcell

types,stagesofdevelopment….andviceversa

•  Bindingsitesareverydegenerate•  Bindingsitesareveryshort,canbefoundatrandomin

manyplacesinthegenome

Problemswithsequence‐basedsearches

•  Notallbindingsitesdefined•  Samesitemightbinddifferentfactorsindifferentcell

types,stagesofdevelopment….andviceversa

•  Bindingsitesareverydegenerate•  Bindingsitesareveryshort,canbefoundatrandomin

manyplacesinthegenome

OtherProblems….

•  WehavetorememberthatthefactthatabindingsiteexistsdoesNOTmeanthatitisbound.

•  Thesitesareshortanddegenerate,andsoappearmany(mesatrandominthegenome(Inrevery512bpandTATAevery120bp)

OtherProblems….

•  It’salsoimportanttorememberthatmanytranscrip(onfactorbindingsitesareac(veregardlessoftheorienta(on(plusstrandorminusstrand)

•  MostalgorithmsarebuilttodealwithsinglestrandedDNA(onedirec(on),whilemanypromoterelementsbidirec(onal

Problems

•  Mostprogramspredict1/1000bp•  Genesareonly1perthousands•  Programsbasedontranscrip(onfactordensitymayalsofindenhancers

Loca(on,Loca(on,Loca(on

Themostcri(calfactorinpromotersearchesistodefinethecorrectupstreamregion

Loca(on,Loca(on,Loca(on

Stopthinkingprotein‐centric!

•  mRNAdoesNOTstartwithATG!!!!Transla(onstartswithATG.

•  Inmoreandmoregenes,wefindthatthefirstexonisshortandnon‐coding.Thefirstintronhasatendencytobelong(longerthanmostpredic(onprogramscanhandle).

•  Makesureyouhavethecomplete5’endofyourgene.

This was the state of the database as of 22.02.04, when we ran the search

But then we zoomed out…

*

* * * * * *

This is what the database looked like one year later:

Three of the sequences were added 03.03.04

Wordbasedmethods

•  Trytodefinedifferencesbetweenpromoterandnon‐promoterregionsbasedonthedifferenceinnucleic

acidcomposi(on.

•  Somecomparebetweenpromoterandallother,some

definepromoter,coding,andothernon‐coding(intronandutr).

•  Promoterelementsofcourse,canbefoundinintrons

aswell….

Phylogene(cFootprin(ng

•  TerminologyborrowedfromDNAfootprin(ng,acommonmethodtodeterminebindingsites.

•  Comparisonoftwoormorespeciestofindconservedtranscrip(onfactorbindingsites.

•  Orthologvsparalog

Phylogene(cFootprin(ng

•  Basedonthetheorythatpromoterelements,likegenes,shouldbeconservedbetweenspecies

•  Alignmentoftwoormoregenomesequences(bothlocalandglobal)todefineareasofconservedsequence

•  Furtheranalysishastobeperformedtofindthepromoterrelatedaspectsofthecomparison

ChoosingOrganisms

•  Nottooclose‐otherwisetoomuchisconserved

•  Nottoofar‐otherwisenotenoughisconserved

•  Recentworkhasshownthatbetweenmouseandhumantherearemajordifferencesinfunc(onalregulatorysites

Whatstepsareneeded?

•  Choosingthesequencestoalign•  Alignmentofthesequences

•  Findingthebindingsites

•  Thesecondandthirdstepscanbedoneineitherorder,dependingonthealgorithm

SequenceAlignment

•  Sequencealignmentcanbelocalorglobal.

•  Someprogramsrequirepre‐alignmentdonebytheuser.Othersincludethealignmentaspartofthepackage.

•  Someprogramsrequiremasking.

Post‐chip

•  Basedontheassump(onthatgenesthathavesimilarexpressionprofileshavesimilarregula(onprofiles(co‐regula(on)

•  Thisassump(onisnotalwaystrue(forexample,posi(onaleffects)

Post‐Chip

•  Twomajortypesofsearch:

– Searchforknownbindingsites

– Searchfornewbindingsites

•  Someprogramsdoone,theotherorboth

•  Someprogramsusetheexpressionprofiles,othersdon’t

Therearedifferenttypesofprograms…

•  Packagesthatdoeverything•  Programsthatlookforknowntranscrip(onfactorbindingsitesingivenpercentagesofsequences

•  Programsthatlookforoverrepresentedbindingsites

•  Programsformo(fdiscovery

The biological thinking has to be done by YOU

Databases

EPD(Eukaryo(cPromoterDatabase)

•  ExperimentallymappedTSS

•  Surroundingregion

•  Curated

TRED

•  Knownpromoters+predictedpromoters

•  Human,mouse,rat

•  Curated

TFD

•  FirstPublicDatabase•  Orginialdatabaselastupdatedin1993•  Newversionavailable:ooTFD

•  Lacksinterfacetousethefiles(originalversionworkedwithGCG,newonecanbeusedwithSQLquerylanguage)

TRRD

•  Hierarchicaldatabasewithmanylevelsofregula(on,frompathwaystoindividualbindingsites

•  Formostapplica(ons,commercial

TransFac

•  Up‐to‐datedatabaseoftranscrip(onfactorbindingsites,consensussequences,andweightmatrices

•  Hasprogramsforsearchandanalysisonsite

•  Moreandmoreofthesiteiscommercial

MatBase

•  Up‐to‐datedatabaseoftranscrip(onfactorbindingweightmatrices

•  Hasprogramsforsearchandanalysisonsite

•  Commercial

JASPAR

•  Transcrip(onFactorbindingsites,modeledasmatrices

•  Opensource!

•  Experimentalevidence(forthecorematrices),includesselex

•  Alsohasphylogene(callyextractedelements

•  Curated,butsmall

dbTSS

•  Transcrip(onstartsitestakenfromsequence(experimental!)data(TSS‐seqinvariouscelllinesandcondi(ons)

•  sourcesofinforma(on:TSS‐Seq,mRNAfrom5’enrichedlibraries,ESTs,orCAGEtags

•  Bestforhumanandmouse,alsohaveinfoforsomeotherspecies

PromoterPredic(onPrograms

NNPP •  Neural Network Promoter Prediction •  Trained on TATA and Inr, allowing

variable lengths between them •  Output: Predicted TSS

SignalScan •  Specific, consensus and matrix searches •  Based on TFD and TransFac •  Web, PC and Unix versions available

Audic/Claverie •  Markov Models based on comaprison of

EPD sequences to sequences flanking them

•  For given window Bayesian choice is made whether it is promoter or non-promoter

Dragon Promoter Finder

•  Uses ANN (artificial neural networks) •  First step: Takes a sliding window and

determines whether it is CG rich or poor •  Second step: Puts through three filters

(promoter, exon, intron) using pentamer PSSM, gives a score.

•  Third step: ANN •  Output: prediction of TSS, directional

Promoter2.0 •  Neural Network-Genetic Algorithm •  Based on conserved sequences and

conserved distances between them •  Discriminates between promoter and

non-promoter sequences •  Output:Predicted TSS

PromFind

•  Intended for finding promoters when gene location is approximately known, strand known

•  Based on differences in hexamer frequencies between promoter, protein coding regions, and noncoding regions downstream of the first exon

•  Finds all possible sites, takes site with highest discrimination between promoter and non-coding as the promoter

•  Output: Only one prediction

PromoterScan

•  Looks at both TATA box weight matrix and density of transcription factor binding sites

•  Compares to promoter recognition profile derived from comparison of promoter to non-promoter Primate sequences (takes the sites from TFD, checks frequency in EPD)

•  Output: Either TSS or 250 bp window representing the core promoter

PromoterScan

•  Latest version includes option to do further analyses: – Compare to EPD to find similar promoter – Provide a list of binding sites common to the

predicted promoter and promoters in EPD

TSSW/TSSG/TSSP

•  Linear discriminant function based on: –  TATA box score –  Triplet preferences around TSS –  Hexamer frequencies in consecutive upstream 100-bp

regions –  Transcription Factor binding sites

•  W - TFD, G - TransFac, P - Plant •  Output: promoter predictions, list of

transcriptional elements

CorePromoter •  Quadratic discriminant analysis

– Pentamers in 30bp windows and 45 bp windows in a 240bp region

•  Output: Predicted TSS

Tess

•  String Search, Filtered String Search and Weight matrices

•  Based on Transfac •  Allows mismatches •  Various cutoffs available •  Filtering is important (by organism, cell type…) •  For matrix searches, 3 classes (vertebrate, non-

vertebrate, and fungi)

TFsearch •  Weight matrices

Match/Patch •  Whats left of what was once available

with transfac •  Uses weight matrices •  Has some tissue specific profiles

Genomatix

•  MatInspector – Core and matrix cuttoffs – Organism classes – Uses MatBase matrices

•  FastM - builds a module – Number of elements – Transcription factor binding sites – Distance between them

PromoterInspector

•  Focuses on on the genetic context of promoters, not their exact location

•  Utilizes an unsupervised learning approach •  Compares word (IUPAC groups) frequencies

between promoters, exons, introns and 3’UTR •  Uses sliding windows which are classified as

above, if a number of consecutive windows have the same classification

•  Always predicts on both strands - context!

PromFD

•  Residue composition •  Find exact promoter (doesn’t specify strands) •  Algorithm:

–  Words of different lengths (5-10bp) over-represented in promoter vs non-promoter

–  Search against weight matrices, again comparing frequency in prom vs non-prom

–  Results input into PromFD database –  Input sequences are searched against this database

ConPro (Consensus Promoter)

•  Align mRNA or EST with genomic sequence

•  Use Genscan to predict missing 5’ region (70kb)

•  Region upstream of 5’ end chosen (1.5 kb)

•  Run TSSG, TSSW, Proscan, PromFD, NNPP

•  Results compared to create consensus

ConPro

GeneSpring •  Can ask the program to take x number

of bases upstream and look for a particular binding site

Grail

•  Can’t separate the promoter prediction module, part of gene prediction

•  Needs coding region (without, can’t find anything)

•  Scores for TATA (must have), GC, CAAT, cap site, translation start

•  Uses matrices and Neural network

phylogene(cfootprin(ngprograms

Local Alignment Programs •  Blast-z (Pipmaker, zpicture) •  Blat •  Dialign

Global Alignment Programs •  AVID (mVista)

– Seed: exact match •  Lagan

– Seed: doesn’t require exact match

Consite

•  Uses DPB global alignment •  Local scanning for conserved segments •  Scan conserved segments for

conserved binding sites (Ann-spec, Jaspar)

•  Can also input existing alignment, or profile of choice

Conreal

•  Starts by searching for transcription factor binding sites

•  Pairwise comparison of hits and flanking regions

•  Sorts list of hits by homology •  Anchors hits on sequence, throwing

out overlaps to existing hits (assumes order)

PromH •  Based on TSSW •  Added to the linear discriminant

function: measure of conservation in several points around predicted TSS

•  Better for TATA containing promoters, than TATAless

Trafac

•  Blast-z (pipmaker) •  Match or MatInspector •  Can search database of existing

alignments •  if you register, you can enter your own

sequence (both phylogenetic and coregulated)

Footprinter

•  Uses dynamic programming •  Requires both sequences and

phylogenetic tree relating them •  Identifies motifs that mutate at a

slower rate than the sequence surrounding them (finds the motifs)

•  Works best with large groups of sequence

rVISTA

•  Based on phylogenetic footprinting •  Input: a global multiple alignment of

two or more sequences (mVISTA or mAVID)

•  Output: a viewer where the user can visualize the predicted transcription factor binding sites on the background of sequence conservation

rVISTA

1)  Potential transcription factor binding sites are predicted based on TRANSFAC for both human and mouse sequences independently. Only the hits where core positions of the human and mouse potential binding sites correspond are called aligned hits. A qualifying aligned hit is allowed a maximum core shift of 6 basepairs (bp), and only one gap of any length inside it.

rVISTA

2) Human-mouse sequence conservation of a DNA region spanning a transcription factor binding site is assessed using a strategy that identifies the maximal percent identity for the DNA fragment surrounding the core of a binding site by allowing a dynamic shift. Only predicted binding sites located in the sequence fragments conserved at the level of over 80% over 24 bp window were selected. These are conserved hits.

promotersingroupsofgenes

WebMOTIFs

•  Runs multiple de novo motif discovery programs – Puts all the results together

•  Runs Bayesian motif discovery program – Has models for transcription factor

families

PROMO (multi-search site) •  Based on Transfac Matrices •  Can be limited by species •  Derives the Matrices from scratch •  Can find hits in all or part of a search

set

OTFBS •  Looks for over represented binding

sites in a group of sequences •  Uses Transfac matrices •  Uses MatInspector to find candidate

binding sites

Toucan

•  Suite of programs: •  Automated upstream retrieval

– Ensembl based – Can also use user sequence

•  Find known transcription factor binding sites – Runs Motif Scanner (Markov Models)

Toucan

•  Identify putative regulatory regions – Runs AVID/VISTA

•  Find new sites – Runs MotifSampler (Gibbs based)

•  Find overrepresented sites – binomial distribution model

•  Finds Modules (using ModuleSearcher) – Two methods: A*, Genetic Algorithm

Amadeus

•  Doesdenovobindingsitefinding•  Canworkonprecomputedgroups,orcanworkonrawmicroarraydata

•  Cancomparetoknownbindingsites

Prima

•  Part of Expander •  Based on Transfac Matrices •  Looks for over represented Matrices

in a group of coregulated sequences •  Compares to a background model

Reduce

•  For Yeast and Drosophila (maybe good for others as well)

•  The expression level of a gene is modeled as a sum of contributions of all binding sites in the promoter region

•  Identifies motifs, and does regression analysis to infer the the activity.

R-Motif •  Utilizes Expression Coherence •  Motif Characterization •  Motif Refinement (extension and

mutation) •  De Novo Motif finding •  Currently only with Yeast Data

Frameworker

•  Part of Genomatix •  Uses MatBase matrices •  Looks for binding sites in similar

order in a given percentage of your dataset