Outline
Transcription start sites (TSS) annotation goals
Promoter architecture in D. melanogaster
Find the initial transcribed exon
Annotate putative transcription start sitesRNA PolII ChIP-SeqSearch for core promoter motifs
Additional TSS annotation resources
Structure of a typical mRNA
Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10.
Muller F element, heterochromatic, and euchromatic genes show similar expression
levels
Riddle NC, et al. Enrichment of HP1a on Drosophila chromosome 4 genes creates an alternate chromatin structure critical for regulation in this heterochromatic domain. PLoS Genet. 2012 Sep;8(9):e1002954.
Annotate transcription start sites (TSS)
Research goal:Identify motifs that enable Muller F element genes to function within a heterochromatic environment
Annotation goals:Define search regions enriched in regulatory motifs
Define precise location of TSS if possible
Define search regions where TSS could be found
Document the evidence used to support the TSS annotations
Estimated evolutionary distances with respect to D. melanogaster
D. melanogasterD. simulansD. sechelliaD. yakubaD. erectaD. ficusphilaD. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloaD. kikkawai
D. ananassaeD. bipectinata
D. pseudoobscuraD. persimilisD. willistoni
D. mojavensisD. virilisD. grimshawi
New species sequenced by modENCODE
Species Substitutions per neutral
site
D. ficusphila 0.80
D. eugracilis 0.76
D. biarmipes 0.70
D. takahashii 0.65
D. elegans 0.72
D. rhopaloa 0.66
D. kikkawai 0.89
D. bipectinata 0.99Data from table 1 of the modENCODE comparative genomics whitepaper
GEP annotation projects
TSS annotation resourcesWalkthroughs:
Annotation of Transcription Start Sites in Drosophila
Reference:TSS Annotation Workflow
GEP Annotation Report:TSS Report Form (sample report for onecut)
Additional evidence tracks:RNA PolII ChIP-Seq (D. biarmipes only)D. mel TranscriptsConservation (Multiz alignments)
TSS annotation workflow1. Identify the ortholog
2. Note the gene structure in D. melanogaster
3. Annotate the coding exons
4. Classify the type of core promoter in D. melanogaster
5. Annotate the initial transcribed exon
6. Identify any core promoter motifs
7. Define TSS positions or TSS search regions
Evidence for TSS annotations(in general order of importance)
1. Experimental dataRNA-SeqRNA Pol II ChIP-Seq (D. biarmipes only)
2. ConservationType of TSS (peaked versus broad) in D. melanogasterSequence similarity to initial exon in D. melanogasterSequence similarity to other Drosophila species (Multiz)
3. Core promoter motifsInr, TATA box, etc.
Challenges with TSS annotations
Fewer constraints on untranslated regions (UTRs)
UTRs evolve more quickly than coding regionsOpen reading frames, compatible phases of donor and acceptor sites do not apply to UTRs
Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs
Most gene finders do not predict UTRs
Lack of experimental dataCannot use RNA-Seq data to precisely define the TSS
RNA-Seq biases introduced by library construction
cDNA fragmentationStrong bias at the 3’ end
RNA fragmentationMore uniform coverage
Miss the 5’ and 3’ ends of the transcript
Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.
Gene Span
RN
A-S
eq
Read
Cou
nt
5’ 3’
Techniques for finding TSS
Identify the 5’ cap at the beginning of the mRNA
Cap Analysis of Gene Expression (CAGE)RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE)Cap-trapped Expressed Sequence Tags (5’ ESTs)
More information on these techniques:Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200.
Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36.
modENCODE TSS annotations
TSS predictions based on CAGE, RLM-RACE, and 5’ EST data
Results lifted from Release 5 to the Release 6 assembly
Two sets of modENCODE TSS predictionsTSS (Celniker)
Most recent dataset produced by modENCODE
TSS (Embryonic)Older dataset available from FlyBase GBrowse
Use TSS (Celniker) dataset as the primary evidence
Revised TSS analysis for Release 6 not yet available
Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92
FlyBase release 6 assembly
First change in the D. melanogaster assembly since 2005Substantial changes in the heterochromatic regions
modENCODE RNA-Seq data have been re-analyzed relative to the release 6 assembly
Most of the modENCODE datasets are still in release 5
Gnomon predictions for eight Drosophila species
Based on RNA-Seq data from either the same or closely-related species
D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis
Predictions include untranslated regions and multiple isoforms
Records not yet available through the NCBI RefSeq database
RNA Polymerase II core promoter
Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.
Initiator motif (Inr) contains the TSS
TFIID binds to the TATA box and Inr to initiate the assembly of the pre-initiation complex (PIC)
Peaked versus broad promoters
Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.
Peaked promoter(Single strong TSS)
Broad promoter(Multiple weak TSS)
50-300 bp
Promoter architecture in Drosophila
Classify core promoter based on the Shape Index (SI)
Determined by distribution of CAGE and 5’ RLM-RACE reads
Shape index is a continuum
Most promoters in D. melanogaster contain multiple TSS
Median width = 162 bp
~70% of vertebrate genes have broad promoters
Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.
Genes with peaked promoters show stronger spatial and tissue specificity
46% of genes with broad promoters are expressed in all stages of embryonic development
19% of genes with peaked promoters are expressed in all stages
Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.
Peaked and broad promoters are enriched in different core promoter motifs
Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.
Using modENCODE data to classify the type of core promoter in D.
melanogaster Only a subset of the modENCODE data are available through FlyBase
Evidence tracks on the D. melanogaster GEP UCSC Genome Browser [July 2014 (BDGP R6) assembly]:
FlyBase gene annotations (release 6.06)modENCODE TSS (Celniker) annotationsDNase I hypersensitive sites (DHS)9-state and 16-state chromatin modelsRNA Pol II and CBP ChIP-SeqTranscription factor binding site (TFBS) HOT spots
9-state chromatin model
Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5.
DNaseI Hypersensitive Sites (DHS) correspond to accessible regions
Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62.
Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52.
Classifying the D. melanogaster core promoter for each unique TSS
PeakedSingle annotated TSS with a single DHS position
IntermediateSingle annotated TSS with multiple DHS positions within a 300bp window
BroadMultiple annotated TSS with multiple DHS positions within a 300bp window
Insufficient evidenceNo TSS annotations or DHS positions
Additional DHS data from different stages of embryonic development
Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.
DHS data produced by the BDTNP project
Evidence tracks:Detected DHS Positions (Embryos)
DHS Read Density (Embryos)
Determine the gene structure in D. melanogaster
FlyBase: GBrowse UTR CDS
Gene Record Finder: Transcript Details
Identify the initial transcribed exon using NCBI blastn
Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder
Use placement of the flanking exons to reduce the size of the search region if possible
Increase sensitivity of nucleotide searchesChange Program Selection to blastnChange Word size to 7Change Match/Mismatch Scores to +1, -1Change Gap Costs to Existence: 2, Extension: 1
Optimize alignment parameters based on expected levels of conservation
Derive alignment scores using information theory
Relative entropy of target and background frequencies
Match +2, Mismatch -3 optimized for 90% identity
Match +1, Mismatch -1 optimized for 75% identity
States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods 3:66-70.
RNA PolII ChIP-Seq tracks (D. biarmipes only)
Show regions that are enriched in RNA Polymerase II compared to input DNA
Gene Models
RNA PolII Peaks
RNA PolII Enrichment
RNA-Seq
Use core promoter motifs to support TSS annotationsSome sequence motifs are enriched in the region (~300 bp) surrounding the TSS
Some motifs (e.g., Inr, TATA) are well-characterized
Other motifs are identified based on computational analysis
Presence of core promoter motifs can be used to support the TSS annotations
Inr motif (TCAKTY) overlaps with the TSS (-2 to +4)
Absence of core promoter motifs is a negative result
Most D. melanogaster TSS do not contain the Inr motif
Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs
Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087.
TATA box
Initiator (Inr)
Available under “Projects” “Annotation Resources” “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html
Core Promoter Motifs tracks
Show core promoter motif matches for each contig
Separated by strand
Visualize matches to different core promoter motifs
Use UCSC Table Browser (or other means) to export the list of motif matches within the search region
Using RNA PolII ChIP-Seq and RNA-Seq data to define the TSS search region
TSS search region
D. mel Transcripts
RNA PolII Peaks
RNA PolII Enrichment
RNA-Seq
TSS annotation for Rad23
TSS position: 28,936Conservation with D. melanogaster
blastn of initial exon and the D. mel Transcripts track
Location of the Inr motif
TSS search region: 28,716-28,936Enrichment of RNA PolII upstream of the TSS positionRNA-Seq read coverage upstream of the TSS positionSearch region defined by the extent of the RNA PolII peak
TSS annotation section in the GEP Annotation Report Form
TSS report formClassify the type of core promoterEvidence that supports or refutes the TSS annotationDistribution of core promoter motifs in the project and in D. melanogaster
Sample TSS report for onecut on the GEP web site
Under the “Beyond Annotation” section
D. biarmipes Muller F element TSS annotation projects
Reconcile coding region annotations submitted by GEP students
Reconciled annotations might be incorrectBased on older FlyBase releasePossible misannotations
See the “Revised gene models report form” section of the Transcription Start Sites Project Report
Additional TSS annotation resources
The D. melanogaster gene annotations are the primary source of evidence
Resources that might be useful if the D. melanogaster evidence is ambiguous
Whole genome alignments of 14 Drosophila speciesPhastCons and PhyloP conservation scores
Genome browsers for nine Drosophila speciesRNA Pol II ChIP-Seq (D. biarmipes only)
RNA-Seq coverage, TopHat junctions, assembled transcripts
Augustus and N-SCAN gene predictions
Conservation tracks on the D. melanogaster GEP UCSC Genome Browser
Whole genome alignments of 14 Drosophila speciesDrosophila Chain/Net composite track
Generate multiple sequence (Multiz) alignments from these pairwise alignments
Identify conserved regions from Multiz alignmentsPhastCons: identify conserved elementsPhyloP: measure level of selection at each nucleotide
Multiz alignment of 27 insect species available on the official UCSC Genome Browser
Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly
Use the conservation tracks to identify regions under selection
PhyloP scores:
Under negative selection
Fast-evolving
Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS
April 2013 (BCM-HGSC/Dbia_2.0) assembly
Search for orthologous regions in D. elegans
Use RNA-Seq data to predict untranslated regions and putative TSS
TSS predictions available for 9 Drosophila species
N-SCAN+PASA-EST, Augustus, TransDecoder
D. mel Proteins
N-SCANAugustus
TransDecoder
RNA-Seq
TSS annotation summary
Most of the D. melanogaster core promoters have multiple TSS
Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogasterDefine search regions that might contain TSS
Use multiple lines of evidence to infer the TSS region
Identify initial exonRNA-Seq coverage
blastn (change search parameters)
RNA PolII ChIP-Seq peaksDistribution of core promoter motifs (e.g., Inr)Maintain conservation compared to D. melanogaster
Top Related