Computaonal Approaches to Promoter Analysis Part II€¦ · Word based methods • Try to define...
Transcript of Computaonal Approaches to Promoter Analysis Part II€¦ · Word based methods • Try to define...
Computa(onalApproachestoPromoterAnalysis
PartII
ShifraBen‐Dor June2010
Firstdefineyourques(on
• Areyoulookingforthetranscrip(onstartsite?
• Areyouinterestedintranscrip(onfactorbindingsites?
• Areyouinterestedinproximalordistalsignals?
• LocateTSS
• Iden(fytranscrip(onfactorbindingsites
• Characterizeregulatoryproper(es
TSSiden(fica(on
• ThereareprogramsthataKempttoiden(fyTSS
• Thesearebasedondifferentalgorithms
• Thebestistousewhatevertranscrip(onalinforma(onisavailable
• ExpertsPick:UCSCGenomeBrowser
Fickett JW, Hatzigeorgiou AG. Genome Res 1997 Sep;7(9):861-78
Evaluation of several programs ability to predict TSS If no tss, then 3’ end of promoter region Correct if 200 bp 5’ or 100 bp 3’ of actual
Both studies found that the absolute measures of correctness of all programs were quite low, although Hu et al. found that the algorithms which they tested were capable of predicting at least one binding site accurately more than 90% of the time.
MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS, Sept. 2009, p. 481–509 Vol. 73, No. 3
• Sequencebasedmethods
• Wordbasedmethods
• Phylogene(cfootprin(ng
Transcription factor binding sites:
Sequencebasedmethods
• Lookforconsensussequencesthatareknowntobeinpromoters
– PromoterelementssuchasTATA,CCAAT
– Transcrip(onFactorbindingsites
• Comparetoexis(ngpromoters
• Bothrequiredatabases,eitherofpromoters,
promoterelementsorbindingsites
Corepromoterconservedsequences
• TATAbox TATAA(AA)
• InrBoxMammals: PyPy(C)A+1NT/APyPy
Drosophila: TCA+1G/TTC/T
• DPE(A+24) A/G+28G,A/T,C/T,G/A/T
• MTE CSARCSSAACGS
• BRE G/C,G/C,G/A,CGCC
• CpGIslands
Wheredowegetthesequencesfrom?
• EMSA
• Fingerprin(ng• In‐vitroselec(onorSELEX(Systema(cEvolu(onof
LigandsbyEXponen(alenrichment)
• High‐throughputbindingtooligos• Computa(onalanalysisofsuspectedco‐regulated
genes
In‐vitroSelec(on:SELEX
• StartfromalargelibraryofdsDNAoligos(20‐30bp)
• Allowproteintobind• Selectsequencesthatdobind(e.g.column)• Releasethemfrombinding
• Amplifythesequencesofthebinders• Repeat5‐15(mes• Alignfinalsequences
Typesofsequence
• Specificsequence
• Consensussequence– Majority– IUPAC(degenerate)
• Posi(onWeightMatrices(PWM)
ACGTCG TGGTAG ATGTAG ATGTAG
ATGTAG
WBGTMG
1 2 3 4 5 6 A 3 0 0 0 3 0C 0 1 0 0 1 0 G 0 1 4 0 0 4T 1 2 0 4 0 0
Posi(onWeightMa(x(PWM)
• Assignsaweighttoeachpossiblenucleo(deateachposi(onofabindingsite
• Thesumoftheweightsisthesitescore
• Basis:differentposi(onsinthesitecontributeindependentlytobinding
• Cangiveascoretomatches
Databases
• EPD(Eukaryo(cPromoterDatabase)
• Transfac
• Matbase
• JASPAR
• TRED
• TFD(Transcrip(onFactorDatabase)
• TRRD(Transcrip(onRegulatoryRegionDatabase)
• dbTSS(DatabaseofTranscrip(onStartSites)
PlantSpecificDatabases
• Place
• PlantCare
• PlantPromDB
Problemswithsequence‐basedsearches
• Notallbindingsitesdefined• Samesitemightbinddifferentfactorsindifferentcell
types,stagesofdevelopment….andviceversa
• Bindingsitesareverydegenerate• Bindingsitesareveryshort,canbefoundatrandomin
manyplacesinthegenome
Problemswithsequence‐basedsearches
• Notallbindingsitesdefined• Samesitemightbinddifferentfactorsindifferentcell
types,stagesofdevelopment….andviceversa
• Bindingsitesareverydegenerate• Bindingsitesareveryshort,canbefoundatrandomin
manyplacesinthegenome
OtherProblems….
• WehavetorememberthatthefactthatabindingsiteexistsdoesNOTmeanthatitisbound.
• Thesitesareshortanddegenerate,andsoappearmany(mesatrandominthegenome(Inrevery512bpandTATAevery120bp)
OtherProblems….
• It’salsoimportanttorememberthatmanytranscrip(onfactorbindingsitesareac(veregardlessoftheorienta(on(plusstrandorminusstrand)
• MostalgorithmsarebuilttodealwithsinglestrandedDNA(onedirec(on),whilemanypromoterelementsbidirec(onal
Problems
• Mostprogramspredict1/1000bp• Genesareonly1perthousands• Programsbasedontranscrip(onfactordensitymayalsofindenhancers
Loca(on,Loca(on,Loca(on
Themostcri(calfactorinpromotersearchesistodefinethecorrectupstreamregion
Loca(on,Loca(on,Loca(on
Stopthinkingprotein‐centric!
• mRNAdoesNOTstartwithATG!!!!Transla(onstartswithATG.
• Inmoreandmoregenes,wefindthatthefirstexonisshortandnon‐coding.Thefirstintronhasatendencytobelong(longerthanmostpredic(onprogramscanhandle).
• Makesureyouhavethecomplete5’endofyourgene.
This was the state of the database as of 22.02.04, when we ran the search
But then we zoomed out…
*
* * * * * *
This is what the database looked like one year later:
Three of the sequences were added 03.03.04
Wordbasedmethods
• Trytodefinedifferencesbetweenpromoterandnon‐promoterregionsbasedonthedifferenceinnucleic
acidcomposi(on.
• Somecomparebetweenpromoterandallother,some
definepromoter,coding,andothernon‐coding(intronandutr).
• Promoterelementsofcourse,canbefoundinintrons
aswell….
Phylogene(cFootprin(ng
• TerminologyborrowedfromDNAfootprin(ng,acommonmethodtodeterminebindingsites.
• Comparisonoftwoormorespeciestofindconservedtranscrip(onfactorbindingsites.
• Orthologvsparalog
Phylogene(cFootprin(ng
• Basedonthetheorythatpromoterelements,likegenes,shouldbeconservedbetweenspecies
• Alignmentoftwoormoregenomesequences(bothlocalandglobal)todefineareasofconservedsequence
• Furtheranalysishastobeperformedtofindthepromoterrelatedaspectsofthecomparison
ChoosingOrganisms
• Nottooclose‐otherwisetoomuchisconserved
• Nottoofar‐otherwisenotenoughisconserved
• Recentworkhasshownthatbetweenmouseandhumantherearemajordifferencesinfunc(onalregulatorysites
Whatstepsareneeded?
• Choosingthesequencestoalign• Alignmentofthesequences
• Findingthebindingsites
• Thesecondandthirdstepscanbedoneineitherorder,dependingonthealgorithm
SequenceAlignment
• Sequencealignmentcanbelocalorglobal.
• Someprogramsrequirepre‐alignmentdonebytheuser.Othersincludethealignmentaspartofthepackage.
• Someprogramsrequiremasking.
Post‐chip
• Basedontheassump(onthatgenesthathavesimilarexpressionprofileshavesimilarregula(onprofiles(co‐regula(on)
• Thisassump(onisnotalwaystrue(forexample,posi(onaleffects)
Post‐Chip
• Twomajortypesofsearch:
– Searchforknownbindingsites
– Searchfornewbindingsites
• Someprogramsdoone,theotherorboth
• Someprogramsusetheexpressionprofiles,othersdon’t
Therearedifferenttypesofprograms…
• Packagesthatdoeverything• Programsthatlookforknowntranscrip(onfactorbindingsitesingivenpercentagesofsequences
• Programsthatlookforoverrepresentedbindingsites
• Programsformo(fdiscovery
The biological thinking has to be done by YOU
Databases
EPD(Eukaryo(cPromoterDatabase)
• ExperimentallymappedTSS
• Surroundingregion
• Curated
TRED
• Knownpromoters+predictedpromoters
• Human,mouse,rat
• Curated
TFD
• FirstPublicDatabase• Orginialdatabaselastupdatedin1993• Newversionavailable:ooTFD
• Lacksinterfacetousethefiles(originalversionworkedwithGCG,newonecanbeusedwithSQLquerylanguage)
TRRD
• Hierarchicaldatabasewithmanylevelsofregula(on,frompathwaystoindividualbindingsites
• Formostapplica(ons,commercial
TransFac
• Up‐to‐datedatabaseoftranscrip(onfactorbindingsites,consensussequences,andweightmatrices
• Hasprogramsforsearchandanalysisonsite
• Moreandmoreofthesiteiscommercial
MatBase
• Up‐to‐datedatabaseoftranscrip(onfactorbindingweightmatrices
• Hasprogramsforsearchandanalysisonsite
• Commercial
JASPAR
• Transcrip(onFactorbindingsites,modeledasmatrices
• Opensource!
• Experimentalevidence(forthecorematrices),includesselex
• Alsohasphylogene(callyextractedelements
• Curated,butsmall
dbTSS
• Transcrip(onstartsitestakenfromsequence(experimental!)data(TSS‐seqinvariouscelllinesandcondi(ons)
• sourcesofinforma(on:TSS‐Seq,mRNAfrom5’enrichedlibraries,ESTs,orCAGEtags
• Bestforhumanandmouse,alsohaveinfoforsomeotherspecies
PromoterPredic(onPrograms
NNPP • Neural Network Promoter Prediction • Trained on TATA and Inr, allowing
variable lengths between them • Output: Predicted TSS
SignalScan • Specific, consensus and matrix searches • Based on TFD and TransFac • Web, PC and Unix versions available
Audic/Claverie • Markov Models based on comaprison of
EPD sequences to sequences flanking them
• For given window Bayesian choice is made whether it is promoter or non-promoter
Dragon Promoter Finder
• Uses ANN (artificial neural networks) • First step: Takes a sliding window and
determines whether it is CG rich or poor • Second step: Puts through three filters
(promoter, exon, intron) using pentamer PSSM, gives a score.
• Third step: ANN • Output: prediction of TSS, directional
Promoter2.0 • Neural Network-Genetic Algorithm • Based on conserved sequences and
conserved distances between them • Discriminates between promoter and
non-promoter sequences • Output:Predicted TSS
PromFind
• Intended for finding promoters when gene location is approximately known, strand known
• Based on differences in hexamer frequencies between promoter, protein coding regions, and noncoding regions downstream of the first exon
• Finds all possible sites, takes site with highest discrimination between promoter and non-coding as the promoter
• Output: Only one prediction
PromoterScan
• Looks at both TATA box weight matrix and density of transcription factor binding sites
• Compares to promoter recognition profile derived from comparison of promoter to non-promoter Primate sequences (takes the sites from TFD, checks frequency in EPD)
• Output: Either TSS or 250 bp window representing the core promoter
PromoterScan
• Latest version includes option to do further analyses: – Compare to EPD to find similar promoter – Provide a list of binding sites common to the
predicted promoter and promoters in EPD
TSSW/TSSG/TSSP
• Linear discriminant function based on: – TATA box score – Triplet preferences around TSS – Hexamer frequencies in consecutive upstream 100-bp
regions – Transcription Factor binding sites
• W - TFD, G - TransFac, P - Plant • Output: promoter predictions, list of
transcriptional elements
CorePromoter • Quadratic discriminant analysis
– Pentamers in 30bp windows and 45 bp windows in a 240bp region
• Output: Predicted TSS
Tess
• String Search, Filtered String Search and Weight matrices
• Based on Transfac • Allows mismatches • Various cutoffs available • Filtering is important (by organism, cell type…) • For matrix searches, 3 classes (vertebrate, non-
vertebrate, and fungi)
TFsearch • Weight matrices
Match/Patch • Whats left of what was once available
with transfac • Uses weight matrices • Has some tissue specific profiles
Genomatix
• MatInspector – Core and matrix cuttoffs – Organism classes – Uses MatBase matrices
• FastM - builds a module – Number of elements – Transcription factor binding sites – Distance between them
PromoterInspector
• Focuses on on the genetic context of promoters, not their exact location
• Utilizes an unsupervised learning approach • Compares word (IUPAC groups) frequencies
between promoters, exons, introns and 3’UTR • Uses sliding windows which are classified as
above, if a number of consecutive windows have the same classification
• Always predicts on both strands - context!
PromFD
• Residue composition • Find exact promoter (doesn’t specify strands) • Algorithm:
– Words of different lengths (5-10bp) over-represented in promoter vs non-promoter
– Search against weight matrices, again comparing frequency in prom vs non-prom
– Results input into PromFD database – Input sequences are searched against this database
ConPro (Consensus Promoter)
• Align mRNA or EST with genomic sequence
• Use Genscan to predict missing 5’ region (70kb)
• Region upstream of 5’ end chosen (1.5 kb)
• Run TSSG, TSSW, Proscan, PromFD, NNPP
• Results compared to create consensus
ConPro
GeneSpring • Can ask the program to take x number
of bases upstream and look for a particular binding site
Grail
• Can’t separate the promoter prediction module, part of gene prediction
• Needs coding region (without, can’t find anything)
• Scores for TATA (must have), GC, CAAT, cap site, translation start
• Uses matrices and Neural network
phylogene(cfootprin(ngprograms
Local Alignment Programs • Blast-z (Pipmaker, zpicture) • Blat • Dialign
Global Alignment Programs • AVID (mVista)
– Seed: exact match • Lagan
– Seed: doesn’t require exact match
Consite
• Uses DPB global alignment • Local scanning for conserved segments • Scan conserved segments for
conserved binding sites (Ann-spec, Jaspar)
• Can also input existing alignment, or profile of choice
Conreal
• Starts by searching for transcription factor binding sites
• Pairwise comparison of hits and flanking regions
• Sorts list of hits by homology • Anchors hits on sequence, throwing
out overlaps to existing hits (assumes order)
PromH • Based on TSSW • Added to the linear discriminant
function: measure of conservation in several points around predicted TSS
• Better for TATA containing promoters, than TATAless
Trafac
• Blast-z (pipmaker) • Match or MatInspector • Can search database of existing
alignments • if you register, you can enter your own
sequence (both phylogenetic and coregulated)
Footprinter
• Uses dynamic programming • Requires both sequences and
phylogenetic tree relating them • Identifies motifs that mutate at a
slower rate than the sequence surrounding them (finds the motifs)
• Works best with large groups of sequence
rVISTA
• Based on phylogenetic footprinting • Input: a global multiple alignment of
two or more sequences (mVISTA or mAVID)
• Output: a viewer where the user can visualize the predicted transcription factor binding sites on the background of sequence conservation
rVISTA
1) Potential transcription factor binding sites are predicted based on TRANSFAC for both human and mouse sequences independently. Only the hits where core positions of the human and mouse potential binding sites correspond are called aligned hits. A qualifying aligned hit is allowed a maximum core shift of 6 basepairs (bp), and only one gap of any length inside it.
rVISTA
2) Human-mouse sequence conservation of a DNA region spanning a transcription factor binding site is assessed using a strategy that identifies the maximal percent identity for the DNA fragment surrounding the core of a binding site by allowing a dynamic shift. Only predicted binding sites located in the sequence fragments conserved at the level of over 80% over 24 bp window were selected. These are conserved hits.
promotersingroupsofgenes
WebMOTIFs
• Runs multiple de novo motif discovery programs – Puts all the results together
• Runs Bayesian motif discovery program – Has models for transcription factor
families
PROMO (multi-search site) • Based on Transfac Matrices • Can be limited by species • Derives the Matrices from scratch • Can find hits in all or part of a search
set
OTFBS • Looks for over represented binding
sites in a group of sequences • Uses Transfac matrices • Uses MatInspector to find candidate
binding sites
Toucan
• Suite of programs: • Automated upstream retrieval
– Ensembl based – Can also use user sequence
• Find known transcription factor binding sites – Runs Motif Scanner (Markov Models)
Toucan
• Identify putative regulatory regions – Runs AVID/VISTA
• Find new sites – Runs MotifSampler (Gibbs based)
• Find overrepresented sites – binomial distribution model
• Finds Modules (using ModuleSearcher) – Two methods: A*, Genetic Algorithm
Amadeus
• Doesdenovobindingsitefinding• Canworkonprecomputedgroups,orcanworkonrawmicroarraydata
• Cancomparetoknownbindingsites
Prima
• Part of Expander • Based on Transfac Matrices • Looks for over represented Matrices
in a group of coregulated sequences • Compares to a background model
Reduce
• For Yeast and Drosophila (maybe good for others as well)
• The expression level of a gene is modeled as a sum of contributions of all binding sites in the promoter region
• Identifies motifs, and does regression analysis to infer the the activity.
R-Motif • Utilizes Expression Coherence • Motif Characterization • Motif Refinement (extension and
mutation) • De Novo Motif finding • Currently only with Yeast Data
Frameworker
• Part of Genomatix • Uses MatBase matrices • Looks for binding sites in similar
order in a given percentage of your dataset