Post on 11-Jan-2016
Alon “Alonzo” Sade & Harel “Hipoptam” SheinAdvisor: Prof. Michal Linial (AKA “M”)
29.7.08
Estimating amount of functional sequences in the genome
The ENCODE pilot project◦ Research on ncRNAs◦ Research on Alt.Splicing
Fun in the sun…
Understanding how our genome encodes information
How that information underpins differences between individuals and species
Currently estimated number of protein-coding genes:◦ Human: ∼ 20-25,000◦ Sea urchin: ∼ 23,000◦ Nematode worm: ∼ 19,000◦ Tetrahymena thermophila: ∼ 27,000 ( כי אין לנו
(שמרים We are complex, where is the information ? Protein coding sequences account for
<1.5% of the human genome What is the function of the remainder ?
Alternative splicing ?
Non-protein-coding sequences contain large amounts of regulatory information ?
Recent discoveries say that the vast majority of the mammalian genome is transcribed◦ We’ll get back to that…
Non-coding RNA An RNA that is not translated into a protein Many members in this family It was assumed that leftover RNA was
“junk”
2001 – Mattick claims: “more than 97% of RNA is ncRNA!”
Ancient Repeats A CNE that was inserted into early
mammalian lineage Primarily transposon derived Has since become dormant Most are thought to be neutrally evolving
Required for replication & structural integrity of the chromosome
Encode functional products Required for regulation or processing
◦ Includes sequences that may act as spacers
At least 70% of the mammalian genome is transcribed
A lot of these are ncRNA shows cell-specific or developmental regulation
Functionality?
Noise, by-products for late evolvingBut all may also indicate functionality
!
Recent evidence implicate ncRNAs in control of:◦ chromatin structure◦ epigenetic memory◦ Transcription◦ Translation◦ Splicing (possibly)
Most are evolving quickly but can maintain highly preserved regions in them
∼5% of small segments in mouse & human are under selection(May range between 3%-8%)
Doesn’t include sequences that have diverged for other reasons than evolution
At the time we thought only ∼1.2% is protein-coding
5
conservation is relative
Taken to be substitution rate measured under the assumption “ functional evolving ⇔ neutral rate”
Requires estimate of the “neutral rate of evolution”
Classes expected to be evolving free of constraint
Yes, everything is relative
5
שמירות
התפתחות טבעית
classes chosen have included:
1. Mainly ARs
2. Lineage-specific nonexonic sequences3. Synonymous sites in codons
5
שמירות
התפתחות טבעית
מאפיינים3
Estimates based on ARs may be biased:◦ The annotated and aligned ARs may comprise
mainly slowly evolving subset◦ ARs are under purifying selection
Lineage-specific & Nonexonic sequences Synonymous sites been found to be also biased
The 5% study Conservation Netural rate classes chosen have included:
◦ Lineage-specific nonexonic sequences◦ Synonymous sites in codons◦ Ars
None of which is unbiased
5
שמירות
התפתחות טבעית
מאפיינים3
Conclusions: Functionally RNAs illustrate:◦ Low conservation ↮ loss of functionally◦ Many functional transcripts have more relaxed
structure-function constraints
Many functional elements are unconstrained biologically active but provide no specific
benefit to the organism
CFTR - cystic fibrosis transmembrane conductance regulator
Figure 1. Conservation in the ENCODE CFTR locus
Amount and function of the transcriptional output
Conservation Functionality estimates
Fractions of the genome under purifying selection may be have been underestimated
May get to 11.8%
The ENCyclopedia Of DNA Elements
GeneFrom Wikipedia, the free encyclopedia
”A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”
A public research consortium Launched by US National Human Genome
Research Institute in September 2003 Goal: identify all functional elements in
the human genome sequence Top-down research Project Phases:
◦ Pilot Phase◦ Technological Phase◦ Production Phase
• Goal:
Evaluate a variety of different Evaluate a variety of different methods for use in later stagesmethods for use in later stages
• Using a number of existing techniques to analyse a portion of the genome equal to about 1% (30mb)
• 35 groups provided more than 200 experimental & computational data
• 50% were selected manually50% were selected randomly
• The two main criteria for manually : • The presence of well-studied
genes or other known sequence elements
• The existence of a substantial amount of comparative sequence data
The randomly selected sequences• composed of 500kb regions • selected according to a stratified
random-sampling strategy based on • gene density –
#bases in genes/#other bases
• level of non-exonic conservation• 125 bases windows, base alignment with mouse
75%+, score (prediction), took the low score
• The technology development phase is concurrent with the pilot phase
• Goal:Investigate and develop new, high throughput techniques and protocols
suitable for the production phase
• One major challenge in the ENCODE project is annotating the large number of ncRNAs
• They are difficult to find in computational/experimental means
• Why ?
• We must consider secondary structure as well as nucleotide sequence
• Structure can be detected more reliably from a set of related sequences
• RNA secondary structure is imperative when searching for structured ncRNAs
• So RNA search algorithms are expensive…
אתה מתחיל הכי חזק
שאתה יכולולאט לאט, אתה מגביר !
• In 1985 Sankoff suggested to perform sequence alignment and minimal free energy folding simultaneously
• For two sequences of length n it’s O(n6)• Exponential in the number of sequences• Given the high cost, for many years it
rested in oblivion...
• Several approximation attempts have been developed• FOLDALIGN• Dynalign• Stemloc• Consan
• All trying to increase performance w/o sacrificing accuracy
• They still remain relatively expensive
First align then fold More attractive nowadays RNAz & EvoFold use
existing alignments◦ Thousands of new potential
structured ncRNAs◦ restricted to highly
conserved segments
X As sequence similarity drops, frequent compensating base changes causes misalignments
X Assumes RNA structure is present in all sequences in the alignment
X Global alignments within fixed-width sliding window
CMFinder◦ Search set of orthologous, unaligned seq. for
conservation◦ Doesn’t use external alignments (\orthology)◦ Doesn’t use sliding windows
We scanned 2*56,017 block from UCSC MULTIZ multiple alignment files
We restricted analysis to blocks that don’t overlap exons or conserved elements
8.68 Mb (of 30), 3.87Mb repetitive sequences (RepeatMasker)
10,106 predicted motifs meeting cutoff◦ Composite score > 5◦ Free energy < -5
Estimated false-positive of 50%
Some predicted motifs overlap Sense/antisense to each other Considering as single candidates we have
6587 candidate regions◦ Average region length – 80 nt◦ Covering 6.1% of input◦ More dense in nonrepetative regions (7.9%
against 3.9%)
ENCODE regions are poor in known ncRNAs Only one known ncRNA fully overlapped our
input (has-miR-483) It received a high score (8.6, -31.4) Also scored high as miRNA by RNAmicro
GENCODE annotations aim to identify all human protein-coding genes in the ENCODE regions
40% of our candidates are intergenic 60% overlap some non exonic part of a
coding gene
Elfar Torarinsson et al. Genome Res. 2008; 18: 242-251
Figure 3. Average pairwise sequence similarity of the predicted motifs versus the fraction that has been realigned compared to the
original alignments
To explore the biological relevance of our prediction methods, we selected 11 high-confidence candidates◦ score>9, energy<-15, length>60, base change>5
We tested expression of these 11 candidates and found that 8 of 11 candidates could be detected in human RNA by RT-PCR
Expression of predicted ncRNA candidates by RT-PCR analysis
Expression of predicted ncRNA candidates by RT-PCR analysis
ncRNAs are receiving increasing attention First large-scale search for structured
ncRNA using local structure motif finiding algorithm
One can benefit from realignment consider sequence and structure
Identified several thousand new ncRNA cadidates
Need for high-throughput methods to identify potential functions for the results
2.53 protein coding variants per locus Key to understanding how human
complexity can be encoded by so few genes
Alt. Spilicing has to be demonstrated at the protein level
Many of the alternative isoforms are not likely to add functionality
So what is the advantage ?
Prof. Michal Linial for the guidance You for listening
Pheasant and Mattick (2007), Raising the estimate of functional human sequences, Genome Res., 17: 1245-1253.
The ENCODE Project Consortium (2007), Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature, 447: 799-816.
Torarinsson et al. (2008), Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions, Genome Res., 18:242-251.
Tress et al. (2007), The implications of alternative splicing in the ENCODE protein complement, Proc. Natl Acad. Sci. USA, 104: 5495–5500.