Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”)...

Post on 11-Jan-2016

216 views 0 download

Tags:

Transcript of Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”)...

Alon “Alonzo” Sade & Harel “Hipoptam” SheinAdvisor: Prof. Michal Linial (AKA “M”)

29.7.08

Estimating amount of functional sequences in the genome

The ENCODE pilot project◦ Research on ncRNAs◦ Research on Alt.Splicing

Fun in the sun…

Understanding how our genome encodes information

How that information underpins differences between individuals and species

Currently estimated number of protein-coding genes:◦ Human: ∼ 20-25,000◦ Sea urchin: ∼ 23,000◦ Nematode worm: ∼ 19,000◦ Tetrahymena thermophila: ∼ 27,000 ( כי אין לנו

(שמרים We are complex, where is the information ? Protein coding sequences account for

<1.5% of the human genome What is the function of the remainder ?

Alternative splicing ?

Non-protein-coding sequences contain large amounts of regulatory information ?

Recent discoveries say that the vast majority of the mammalian genome is transcribed◦ We’ll get back to that…

Non-coding RNA An RNA that is not translated into a protein Many members in this family It was assumed that leftover RNA was

“junk”

2001 – Mattick claims: “more than 97% of RNA is ncRNA!”

Ancient Repeats A CNE that was inserted into early

mammalian lineage Primarily transposon derived Has since become dormant Most are thought to be neutrally evolving

Required for replication & structural integrity of the chromosome

Encode functional products Required for regulation or processing

◦ Includes sequences that may act as spacers

At least 70% of the mammalian genome is transcribed

A lot of these are ncRNA shows cell-specific or developmental regulation

Functionality?

Noise, by-products for late evolvingBut all may also indicate functionality

!

Recent evidence implicate ncRNAs in control of:◦ chromatin structure◦ epigenetic memory◦ Transcription◦ Translation◦ Splicing (possibly)

Most are evolving quickly but can maintain highly preserved regions in them

∼5% of small segments in mouse & human are under selection(May range between 3%-8%)

Doesn’t include sequences that have diverged for other reasons than evolution

At the time we thought only ∼1.2% is protein-coding

5

conservation is relative

Taken to be substitution rate measured under the assumption “ functional evolving ⇔ neutral rate”

Requires estimate of the “neutral rate of evolution”

Classes expected to be evolving free of constraint

Yes, everything is relative

5

שמירות

התפתחות טבעית

classes chosen have included:

1. Mainly ARs

2. Lineage-specific nonexonic sequences3. Synonymous sites in codons

5

שמירות

התפתחות טבעית

מאפיינים3

Estimates based on ARs may be biased:◦ The annotated and aligned ARs may comprise

mainly slowly evolving subset◦ ARs are under purifying selection

Lineage-specific & Nonexonic sequences Synonymous sites been found to be also biased

The 5% study Conservation Netural rate classes chosen have included:

◦ Lineage-specific nonexonic sequences◦ Synonymous sites in codons◦ Ars

None of which is unbiased

5

שמירות

התפתחות טבעית

מאפיינים3

Conclusions: Functionally RNAs illustrate:◦ Low conservation ↮ loss of functionally◦ Many functional transcripts have more relaxed

structure-function constraints

Many functional elements are unconstrained biologically active but provide no specific

benefit to the organism

CFTR - cystic fibrosis transmembrane conductance regulator

Figure 1. Conservation in the ENCODE CFTR locus

Amount and function of the transcriptional output

Conservation Functionality estimates

Fractions of the genome under purifying selection may be have been underestimated

May get to 11.8%

The ENCyclopedia Of DNA Elements

GeneFrom Wikipedia, the free encyclopedia

”A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”

A public research consortium Launched by US National Human Genome

Research Institute in September 2003 Goal: identify all functional elements in

the human genome sequence Top-down research Project Phases:

◦ Pilot Phase◦ Technological Phase◦ Production Phase

• Goal:

Evaluate a variety of different Evaluate a variety of different methods for use in later stagesmethods for use in later stages

• Using a number of existing techniques to analyse a portion of the genome equal to about 1% (30mb)

• 35 groups provided more than 200 experimental & computational data

• 50% were selected manually50% were selected randomly

• The two main criteria for manually : • The presence of well-studied

genes or other known sequence elements

• The existence of a substantial amount of comparative sequence data

The randomly selected sequences• composed of 500kb regions • selected according to a stratified

random-sampling strategy based on • gene density –

#bases in genes/#other bases

• level of non-exonic conservation• 125 bases windows, base alignment with mouse

75%+, score (prediction), took the low score

• The technology development phase is concurrent with the pilot phase

• Goal:Investigate and develop new, high throughput techniques and protocols

suitable for the production phase

• One major challenge in the ENCODE project is annotating the large number of ncRNAs

• They are difficult to find in computational/experimental means

• Why ?

• We must consider secondary structure as well as nucleotide sequence

• Structure can be detected more reliably from a set of related sequences

• RNA secondary structure is imperative when searching for structured ncRNAs

• So RNA search algorithms are expensive…

אתה מתחיל הכי חזק

שאתה יכולולאט לאט, אתה מגביר !

• In 1985 Sankoff suggested to perform sequence alignment and minimal free energy folding simultaneously

• For two sequences of length n it’s O(n6)• Exponential in the number of sequences• Given the high cost, for many years it

rested in oblivion...

• Several approximation attempts have been developed• FOLDALIGN• Dynalign• Stemloc• Consan

• All trying to increase performance w/o sacrificing accuracy

• They still remain relatively expensive

First align then fold More attractive nowadays RNAz & EvoFold use

existing alignments◦ Thousands of new potential

structured ncRNAs◦ restricted to highly

conserved segments

X As sequence similarity drops, frequent compensating base changes causes misalignments

X Assumes RNA structure is present in all sequences in the alignment

X Global alignments within fixed-width sliding window

CMFinder◦ Search set of orthologous, unaligned seq. for

conservation◦ Doesn’t use external alignments (\orthology)◦ Doesn’t use sliding windows

We scanned 2*56,017 block from UCSC MULTIZ multiple alignment files

We restricted analysis to blocks that don’t overlap exons or conserved elements

8.68 Mb (of 30), 3.87Mb repetitive sequences (RepeatMasker)

10,106 predicted motifs meeting cutoff◦ Composite score > 5◦ Free energy < -5

Estimated false-positive of 50%

Some predicted motifs overlap Sense/antisense to each other Considering as single candidates we have

6587 candidate regions◦ Average region length – 80 nt◦ Covering 6.1% of input◦ More dense in nonrepetative regions (7.9%

against 3.9%)

ENCODE regions are poor in known ncRNAs Only one known ncRNA fully overlapped our

input (has-miR-483) It received a high score (8.6, -31.4) Also scored high as miRNA by RNAmicro

GENCODE annotations aim to identify all human protein-coding genes in the ENCODE regions

40% of our candidates are intergenic 60% overlap some non exonic part of a

coding gene

Elfar Torarinsson et al. Genome Res. 2008; 18: 242-251

Figure 3. Average pairwise sequence similarity of the predicted motifs versus the fraction that has been realigned compared to the

original alignments

To explore the biological relevance of our prediction methods, we selected 11 high-confidence candidates◦ score>9, energy<-15, length>60, base change>5

We tested expression of these 11 candidates and found that 8 of 11 candidates could be detected in human RNA by RT-PCR

Expression of predicted ncRNA candidates by RT-PCR analysis

Expression of predicted ncRNA candidates by RT-PCR analysis

ncRNAs are receiving increasing attention First large-scale search for structured

ncRNA using local structure motif finiding algorithm

One can benefit from realignment consider sequence and structure

Identified several thousand new ncRNA cadidates

Need for high-throughput methods to identify potential functions for the results

2.53 protein coding variants per locus Key to understanding how human

complexity can be encoded by so few genes

Alt. Spilicing has to be demonstrated at the protein level

Many of the alternative isoforms are not likely to add functionality

So what is the advantage ?

Prof. Michal Linial for the guidance You for listening

Pheasant and Mattick (2007), Raising the estimate of functional human sequences, Genome Res., 17: 1245-1253.

The ENCODE Project Consortium (2007), Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature, 447: 799-816.

Torarinsson et al. (2008), Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions, Genome Res., 18:242-251.

Tress et al. (2007), The implications of alternative splicing in the ENCODE protein complement, Proc. Natl Acad. Sci. USA, 104: 5495–5500.