Na r cisus

Post on 18-Jul-2015

148 views 1 download

Tags:

Transcript of Na r cisus

“NaRCiSuS”

Noncoding RNA Comparitive Searching System

Marie Curie European Reintegration Grants (ERG)

Call: FP7-PEOPLE-RG-2009

What are noncoding RNAs?

• RNA transcripts

• without long translable open reading frame

• highly structured

• realising their functions as RNAs

What do they do?

• RNA-protein machine:– Transfer RNA (tRNA).– Ribosomal RNA (rRNA).– RNAs (snRNAs) in spliceosome.

• Catalytic RNAs (ribozymes): catalyzing some functions.

• Micro RNAs (miRNAs): regulatory roles.• Small interfering RNAs (siRNAs): RNA silencing

– The genome’s immune system. [Plasterk, Science (2002)]– The breakthrough of the year by Science magazine in 2002.

Protein coding gene prediction

1. Prokaryota

• translation initiation site• ribosome binding site (Kozak sequence)• ORF features – length, similarity to known proteins,

codon usage, coding capacity

2. Eukaryota

• poli-A site• 3’, 5’ UTR• axon-intron location, splice sites

ncRNA gene prediction ?

• no translation initiation site

• no ribosome binding site

• no KOZAK sequence

• no significant ORF

ncRNA gene prediction ?

• base composition statistics

• simple translated regions search

• secondary structure based search

• combined comparative approach

RNA secondary structure

• ncRNA is not a random sequence.• Most RNAs fold into particular base-paired

secondary structure.

• Canonical basepairs:– Watson-Crick basepairs:

• G - C• A - U

– Wobble basepair:• G – U

RNA secondary structure

• Stacks: continuous nested basepairs. (energetically favorable)

• Non-basepaired loops:– Hairpin loop.– Bulge.– Internal loop.– Multiloop.

RNA secondary structure

• Most basepairs are non-crossing basepairs.– Any two pairs (i, j) and (i’,j’) i < i’ < j’ < j or i’ < i < j < j’

• Pseudoknots are the crossing basepairs.

Pseudoknots• Pseudoknots are important for certain ncRNAs• Violate the non-crossing assumption.• Pseudoknots make most problems harder • We assume there are no pseudoknots otherwise

noted.

[Rivas and Eddy (1999)]

ncRNA evolution is constrained by it secondary

structure

http://www.sanger.ac.uk/Software/Rfam/

• Drastic sequence changes can be tolerated.• Compensatory mutations are very common.

– One basepair mutates into another basepair.– Doesn’t change its secondary structure.

• In this talk:

ncRNA – conserved structured RNA.

tRNA1:

tRNA2:

Compensatory mutation

Secondary Structure alone is not enough

• de novo prediction:– Find stable secondary

structure from genome. [Shapiro et al. (1990)]

• The stability of ncRNA secondary structure is not sufficiently different from the predicted stability of a random sequence. [Rivas and Eddy (2000)].

– Look transcript signals. [Wassarman et al. (2001), Argaman et al. (2000)]

• ncRNA transcript signals are not strong.

• protein coding gene signals (open reading frame, promoter).

[Rivas and Eddy (2000)].

RNA secondary structure prediction

• It is a basic issue in ncRNA analysis

• It is important information to the biologists.

• Searching and alignment algorithms are based on these models.

• RNA secondary structure -- a set of non-crossing base pairs.

Base pair maximization problem

• A simple energy model is to maximize the number of basepairs to minimize the free energy. [Waterman (1978), Nussinov et al (1978), Waterman and Smith (1978)]

• G – C, A – U, and G – U are treated as equal stability.

• Contributions of stacking are ignored.

A dynamic programming solution

• Let s[1…n] be an RNA sequence.• δ(i,j) = 1 if s[i] and s[j] form a complementary base

pair, else δ(i,j) = 0.• M(i,j) is the maximum number of base pairs in s[i…j].

[Nussinov (1980)]

A dynamic programming solution

• M(1,n) is the number of base pairs in the optimal basepaired structure for s[1…n].

• All these basepairs can be found by tracing back through the matrix M.

• Filling M needs O(n3) time.

RNA structure: example

0

1 1

1 1 0

2 2 1 1

3 2 1 1 0

i 1 2 3 4 5 6j

3

4

5

6

2A C G A U UA C G A U U1 2 3 4 5 61 2 3 4 5 6

Zuker-Sankoff minimum energy model

• Stacks (contiguous nested base pairs) are the dominant stabilizing force – contribute the negative energy

• Unpaired bases form loops contribute the positive energy.– Hairpin loops, bulge/internal loops, and multiloops.

• Zuker-Sankoff minimum energy model. [Zuker and Sankoff (1984), Sankoff (1985)]

• Mfold and ViennaRNA are all based on this model. (this model is also called mfold model)

Zuker-Sankoff minimum energy model

:eH(i,j)

:a+3*b+4*c

:eL(i,j,i’,j’)

ij

i

j J’

i’

[Lyngsø (1999)]

ii+1

j j-1:eS(i,j,i+1,j-1)

RNA minimum energy problem

• This problem can be solved by a dynamic programming algorithm in O(n4) time.

• Lyngsø et al. (1999) revise the energy function for internal loop, proposed an O(n3) time solution.

Zuker-Sankoff model

Recursive functions

• W(i) holds the minimum energy of a structure on s[1…i].

• V(i,j) holds the minimum energy of a structure on s[i…j] with s[i] and s[j] forming a basepair.

• WM(i,j) holds the minimum energy of a structure on s[i…j] that is part of multiloop.

Recursive functions (Zuker)• W(i) holds the

minimum energy of a structure on s[1…i].

• V(i,j) holds the minimum energy of a structure on s[i…j] with s[i] and s[j] forming a basepair.

• WM(i,j) holds the minimum energy of a structure on s[i…j] that is part of multiloop.

A recursive solution• W(i) holds the

minimum energy of a structure on s[1…i].

• V(i,j) holds the minimum energy of a structure on s[i…j] with s[i] and s[j] forming a basepair.

• WM(i,j) holds the minimum energy of a structure on s[i…j] that is part of multiloop.

A recursive solution• W(i) holds the

minimum energy of a structure on s[1…i].

• V(i,j) holds the minimum energy of a structure on s[i…j] with s[i] and s[j] forming a basepair.

• WM(i,j) holds the minimum energy of a structure on s[i…j] that is part of multiloop.

Idea of the project• Integrated bioinformatics platform that is specifically addressed

for detecting, verifying, and classifying of noncoding RNAs. • This complex approach to "Computational RNomics" will provide

the pipeline which will be capable of detecting RNA motifs with low sequence conservation.

• It will also integrate RNA motif prediction which should significantly improve the quality of the RNA homolog search.

• The secondary structure of the RNA can be represented as a graph.

• Graph can distinguish RNA of known families from Rfam database.

Goals

• Using experssed sequence tags and profile-profile protein like method for predicting ORF.

• The candidates will be folded “ab initio” usinf method proposed by zucker (MFold)

• The secondary strucetres of the candidates can be then compared to sequence form Rfam or ncRNA database to find novel ncRNAs or detect missing members of already know families.

Service Features• Prediction missing members of already defined RNA

families • Describe properties of ncRNA and measure

confidence level of their functional importance • Prediction of novel ncRNA “ab initio”. • Visualization secondary and tetriary structure “online” • Exchange formats of RNA annotation. • Identify interaction with coding genes. • Design novel RNA’s for therapy and drug delivery

(RNAi)

SERVER

NCBI

NCBI

Sanger

Repeatmasker

BLASTz

tBLASTn

BLASTn

mRNA/EST%GC

PromH

NCBI

Loop Scaner QRNA

mFOLD

Human Mouse

Human Mouse

PFAM

NR

EST

Remove repeats

Searching for sequence of high

functionality

Searching for coding sequences

Comparison of expression patterns

Searching for sequences of well defined structure

orcontaining compensatory

mutations

Searching for remaining genes

1

2

3

4

5

6