RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

55
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

Page 1: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis

Jurgen Mourik &

Richard VogelaarsUtrecht University

Page 2: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis2

Overview

• Introduction to RNA

• RNA secondary structure prediction– Nussinov folding algorithm– Zuker folding algorithm

• Demonstration

• Questions

Page 3: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis3

Introduction to RNA (1)

• Ribonucleic acid

• To many people:– “RNA is the passive intermediary messenger

between DNA genes and the protein translation machinery”

• But:– Many non-coding RNAs exist

• Adopt sophisticated 3D structures• Catalyse biochemical reactions

Page 4: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis4

Introduction to RNA (2)

• Three major types of RNA

• Messenger RNA (mRNA)– Serving as a temporary copy of genes that is used

as a template for protein synthesis.

• Transfer RNA (tRNA)– Functioning as adaptor molecules that decode the

genetic code.

• Ribosomal RNA (rRNA)– Catalyzing the synthesis of proteins.

Page 5: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis5

RNA world hypothesis

• RNA is the only biological polymer that serves as both a catalyst (like proteins) and as information storage (like DNA).

• For this reason some people think that a RNA-like molecule was the basis of life early in evolution.

Page 6: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis6

Terminology of RNA (1)

• Four nucleotides: – Adenine– Cytosine– Guanine– Uracil

• Canonical base pairs:– G-C– A-U

• Non-canonical base pairs– G-U

Page 7: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis7

Terminology of RNA (2)

• Base pairs are approximately coplanar and almost always stacked onto other base pairs in a RNA structure– Contiguous stacked base pairs are called stems– In 3D, RNA stems generally form a regular double

helix

Page 8: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis8

RNA secondary structure

• Unlike DNA, RNA is typically produced as a single stranded molecule which then folds intramolecularly to form a number of short base-paired stems. This base-paired structure is called the secondary structure of the RNA.

Page 9: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis9

Elements of a RNA secondary structure (1)

• Loop: single stranded subsequence bounded by base pairs

• Hairpin loop: a loop at the end of a stem

• Bulge (loop): single stranded bases occurring within a stem

• Interior loop: single stranded bases interrupting both sides of a stem

• Multi-branched loop: a loop from which three or more stems radiate

Page 10: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis10

Elements of a RNA secondary structure (2)

G ● C G ● C U ● A A ● U C ● G

G G 3’

G A 5’

CCC

etc.

UGU

Page 11: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis11

Pseudoknots (1)

• Base pairs almost always occur in a nested fashion in RNA secondary structure

• A base pair between position i and j and a base pair between i’ and j’ are nested if and only if:

• Non-nested base pairs are called pseudoknots

''or '' jjiijjii

Page 12: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis12

Pseudoknots (2)

• None of the dynamic programming algorithms can deal with pseudoknots, including the Zuker and Nussinov RNA folding algorithms.

• Pseudoknots occur in many important RNA’s:– The algorithms ignore biologically important

information.

• For database searching for RNA homologues, it is acceptable to sacrifice the information in pseudoknots.

Page 13: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis13

RNA sequence evolution

• The sequence evolution of RNA is constrained by the structure.

• It is possible to have two different RNA sequences with the same secondary structure.

• Drastic changes in sequence can often be tolerated as long as compensatory mutations maintain base-pairing complementarity.

Page 14: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis14

RNA sequence evolution (2)

• Suppose we want to search for a nucleotide sequence for occurrences of consensus R17 coat protein:– It is useless to use standard

sequence alignment

• R17 coat protein binds and represses translation of its replicase:– It blinds most of the primary

sequence positions

{A, C, G, U}

{A,G}

{C,U}

Complement of base N

Page 15: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis15

RNA sequence evolution (3)

• How to solve this problem?– RNA pattern-matching program

(RNAMOT).

• Searches for deterministic (non-stochastic) motifs but with secondary structure constraints as extra terms.

• Works fine for small, well-defined patterns but is somewhat insensitive and problematic for finding matches to less well conserved structures.

Page 16: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis16

Inferring structure by comparative sequence analysis

• In a structurally correct multiple alignment of RNAs, conserved base pairs are often revealed by the presence of frequent correlated compensatory mutations

• RNA secondary prediction method: comparative sequence analysis

• The accepted consensus structures of most well-studied RNAs have been derived by comparative analysis.

Page 17: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis17

How does comparative sequence analysis work? (1)

• Inferring the correct structure by comparative analysis requires knowing a structurally correct alignment

• Inferring a structurally correct multiple alignment requires knowing the correct structure

Problem!

Page 18: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis18

How does comparative sequence analysis work? (2)

• Solution: make use of an iterative refinement process of:– Guessing the structure based on the current best

guess of the alignment– Realigning based on the new guess at the

structure• The sequences to be compared must be:

– Sufficiently similar to start the process– Sufficiently dissimilar that a number of co-varying

substitutions can be detected

Page 19: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis19

Mutual information (1)

• A quantitative measure of pairwise sequence covariation

• Given two aligned columns i, j, the mutual information is given by:

ji ji

ji

jixx xx

xx

xxij ff

ffM

,2log

The frequency of one of the four bases

observed in column i.

The joint (pairwise) frequency of one of the sixteen possible base pairs observed in

columns i, j.

Mij varies between 0 and 2 bits

Page 20: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis20

Mutual information (2)

• Mij tells us how much information we get about the identity of the residue in one position if we are told the identity of the residue in the other position– If you know that i is a G, the uncertainty about j

collapses from four different possibilities to just one (C) 2 bits of information

– If i and j are uncorrelated, the mutual information is zero

Page 21: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis21

RNA secondary structure prediction (1)

• Many plausible secondary structures can be drawn for a sequence

• But: the number of secondary structures increases exponentially with sequence length– An RNA of 200 bases has over 1050 possible

base-paired structures

• Goal: distinguish the biologically correct structure from all the incorrect structures.

Page 22: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis22

RNA secondary structure prediction (2)

• We need:– A function that assigns the correct structure the

highest score– An algorithm for evaluating the scores of all

possible structures

• Two methods:– Nussinov folding algorithm– Zuker folding algorithm

Page 23: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

Need a break?

Well here it is!

Page 24: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis24

Nussinov folding algorithm (1)

• Goal: Find the structure with the most base pairs

• Nussinov introduced an efficient dynamic programming algorithm for this problem

• A recursive algorithm that calculates – the best structure for small subsequences and– works its way outwards to larger and larger

subsequences

Page 25: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis25

Nussinov folding algorithm (2)

• Key idea of recursion:– There are only four possible ways of getting the

best structure for i,j from the best structure of the smaller subsequences

• Two stages:– Fill stage of the algorithm– Trace back stage of the algorithm

Page 26: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis26

Nussinov folding algorithm (3)

• The four possible ways:1. Add unpaired position i onto the best structure for subsequence i+1,j

2. Add unpaired position j onto the best structure for subsequence i,j-1

3. Add i,j pair onto best structure found for subsequence i+1,j-1

4. Combine two optimal substructures i,k and k+1,j

Page 27: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis27

Nussinov folding algorithm (4)

• Formal description of the algorithm:– Given a sequence x of length L with symbols xi,…,xL

– Let if xi and xj are complementary base pairs else

– Recursively calculate scores which are the maximum number of base pairs that can be formed for subsequence xi,…,xL

1),( ji0),( ji

),( ji

Page 28: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis28

Nussinov algorithm: fill stage

– Initialisation:

– Recursion: starting with all sub sequences of length 2, to length L:

Liii

Liii

to1for 0),(

to2for 0)1,(

)].,1(),([max

),,()1,1(

),1,(

),,1(

max),(

jkki

jiji

ji

ji

ji

jki

Page 29: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis29

Example sequence: GGGAAAUCC

j

i

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0

2 G 0 0

3 G 0 0

4 A 0 0

5 A 0 0

6 A 0 0

7 U 0 0

8 C 0 0

9 C 0 0

Page 30: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis30

Example sequence: GGGAAAUCC

j

i

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0

2 G 0 0 0

3 G 0 0 0

4 A 0 0 0

5 A 0 0 0

6 A 0 0 1

7 U 0 0 0

8 C 0 0 0

9 C 0 0

A*U= base pair

110

)7,6()6,7(

),()1,1(

jiji

Page 31: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis31

Example sequence: GGGAAAUCC

j

i

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

This value gives the

maximum nr. of base pairs

Page 32: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis32

Nussinov algorithm: traceback stage

• Initialisation: Push (1,L) onto the stack.

• Recursion: Repeat until stack is empty:

break

),(push

),1(push

);,( ),1(),( if :1 to1for else

)1,1(push

pair base , record

:),()1,1( if else

);1,(push ),()1,( if else

);,1(push ),(),1( if else

continue if

),(pop

,

ki

jk

jijkkijik

ji

ji

jiji

jijiji

jijiji

ji

ji

ji

Page 33: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis33

Example sequence: GGGAAAUCC

j

i

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

Initialisation:Push (1,L)

Page 34: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis34

Example sequence: GGGAAAUCC

j

i

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

Recursion:

)9,1()9,2(

),1(push ),(),1( if else

jijiji

Page 35: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis35

Example sequence: GGGAAAUCC

j

i

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

Page 36: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis36

Example sequence: GGGAAAUCC

j

i

6 7 8 9

A U C C

1 G 0 1 2 3

2 G 0 1 2 3

3 G 0 1 2 2

4 A 0 1 1 1

5 A 0 1 1 1

6 A 0 1 1 1

7 U 0 0 0 0

GG ● CG ● CA

AA

● U

Page 37: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis37

SCFG version of the Nussinov algorithm

• Stochastic Context-Free Grammars– Will be discussed next Wednesday

• Makes use of production rules:– S aS | cS | gS | uS (i unpaired)

• Every production rule has a associated probability parameter.

• The maximum probability parse is equivalent to the maximum probability secondary structure.

Page 38: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis38

Needed terminology

• The inside-outside (recursive dynamic programming) algorithm for SCTGs in Chomsky normal form is the natural counterpart of the forward-backward algorithm for HMM.

• Best path variant of the inside-outside algorithm is the Cocke-Younger-Kasami (CYK) algorithm. It finds the maximum probabilistic alignment of the SCFG to the sequence.

Just as the viterbi algorithm for

HMMs

Chomsky normal form:All context free grammar production rules are of the form:

S SS orS a

Page 39: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis39

CYK for Nussinov-style RNA SCFG (2)

• Initialisation:

• Recursion:

LiSxp

Sxpii

Liii

i

i to1for )(log

)(logmax),(

to2for )1,(

).(log),1(),(max

);(log)1,1(

);(log)1,(

);(log),1(

max),(

SSpjkki

Sxxpji

Sxpji

Sxpji

ji

jki

ji

j

i

Addition to the fill stage of the Nussinov

algorithm.The principal difference

is that the SCFG description is a

probabilistic model.

Page 40: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis40

CYK for Nussinov-style RNA SCFG (2)

• The is the log likelihood of the optimal structure given the SCFG model

• The traceback to find the secondary structure corresponding to the best score is performed analogously to the traceback in the Nussinov algorithm

),1( L )|ˆ,(log xP̂

Page 41: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis41

CYK for Nussinov-style RNA SCFG (3)

• Good starting example (10.2), but it is too simple to be an accurate RNA folder

• The algorithm does not consider important structural features like preferences for certain:– Loop lengths– Nearest neighbours in the structure caused by

stacking interactions between neighbouring base pairs in a stem.

Page 42: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis42

Zuker folding algorithm (1)

• Most sophisticated secondary structure prediction method for single RNAs– An energy minimisation algorithm which assumes

that the correct structure is the one with the lowest equilibrium free energy

• The of an RNA secondary structure is approximated as the sum of individual contributions from loops, base pairs and other secondary structure elements.

G

G

Page 43: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis43

Zuker folding algorithm (2)

• Difference with the Nussinov folding algorithm:– Energies of stems are calculated by adding

stacking contributions for the interface between neighbouring base pairs instead of individual contributions for each pair.

• Advantage:– Better fit to experimentally observed values for

RNA structures, but it complicates the dynamic programming algorithm

G

Page 44: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis44

Zuker folding algorithm (3)Freier energy rules

• The energies in the tables are from the older ‘Freier rules’ at 37ºC.

• For more information see the article ”Improved free-energy parameters for predictions of RNA duplex stability” by Freier et al.

Page 45: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis45

Zuker folding algorithm (4)

Page 46: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis46

Zuker folding algorithm (5)

Page 47: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis47

Zuker folding algorithm (6)

• The minimum energy structure can be calculated recursively by a dynamic programming algorithm very similar to how the maximum base-paired structure was calculated like the Nussinov algorithm.

• Now we keep two matrices– W(i,j) is the energy of the best structure on i,j– V(i,j) is the energy of the best structure on i,j given

that i,j are paired.

Page 48: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis48

Suboptimal RNA folding(CYK algorithm will be explained next Wednesday)

• The original Zuker algorithm finds only the optimal structure.

• The biologically correct structure is often not the calculated optimal structure.

• Zuker introduced a suboptimal folding algorithm.– Is similar to running the CYK algorithm in both inside and

outside directions.

• The algorithm samples one base pair sub optimally.

• The rest of the structure is the optimal structure given that base pair.

Page 49: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

Demonstration

RNAstructureBy David H. Mathews

Michael Zuker

Doulas H. Turner

Page 50: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis50

Demo: RNAstructure (1)

• The core of RNAstructure is a dynamic programming algorithm to predict RNA or DNA secondary structures from sequence based on the principle of minimizing free energy.

Page 51: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis51

Demo: RNAstructure (2)

• The prediction of a secondary structure is based on the Zuker algorithm for free energy minimization using the nearest neighbour parameters of Doug Turner and co-workers.

• A recursive algorithm is used that generates an optimal structure and a series of structures that are called sub-optimal structures (structures with free energy similar to the lowest free energy structure).

Page 52: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis52

Demo: RNAstructure (3)

• The number of sub-optimal structures generated is controlled by two parameters entered by the user:– Max % Energy Difference: Sets the percent difference

from the lowest free energy allowed for the structures output. For example if the lowest-free energy structure is -100 kcal/mol, and the Max % Energy Difference is 10, any structures with an energy of -90 kcal/mol or higher is rejected (higher means less negative).

– Max number of structures: Sets an absolute upper limit on the number of structures that can be generated. A maximum of 1000 structures can be generated.

Page 53: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

RNA structure analysis53

Demo: RNAstructure (4)

• A third parameter entered is Window Size. This controls how different the sub-optimal structures must be from each other. A small window size allows very similar structures to be generated while a larger window size requires them to be more different

Page 54: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

Demonstration

Page 55: RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.

Questions?