RNA Secondary Structure Prediction

48
RNA Secondary Structure Prediction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO 65211-2060 E-mail: [email protected] 573-882-7064 (O) http://digbio.missouri.edu

description

RNA Secondary Structure Prediction. Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO 65211-2060 E-mail: [email protected] 573-882-7064 (O) http://digbio.missouri.edu. Final Report. Due on Dec. 8. - PowerPoint PPT Presentation

Transcript of RNA Secondary Structure Prediction

Page 1: RNA Secondary Structure Prediction

RNA Secondary Structure Prediction

Dong Xu

Computer Science Department271C Life Sciences Center

1201 East Rollins RoadUniversity of Missouri-Columbia

Columbia, MO 65211-2060E-mail: [email protected]

573-882-7064 (O)http://digbio.missouri.edu

Page 2: RNA Secondary Structure Prediction

Final Report

Due on Dec. 8. A numerical score (0-15) will be assigned for

based onClear formulation of the project (2)

Method (4)

Significant results achieved (4)

Discussion (3)

Writing of the project (2)

Page 3: RNA Secondary Structure Prediction

Final Presentation

Preferably shown in powerpoint file, pdf is fine

Preferably 20 minutes (up to 25 min), plus 5min for questions

15 points for the presentation (introduction, methods, results, discussions)

15 points for software demo Implementation of the software Major functionalities Documentation Perform a test run

Page 4: RNA Secondary Structure Prediction

Presentation Evaluation

A numerical score (0-15) will be assigned based onDid the student put enough effort? (3)

Is the work interesting or novel? (3)

Is the method technically sound? (3)

Is the discussion insightful? (3)

Is the presentation clear? (3)

Page 5: RNA Secondary Structure Prediction

Software Demo

A numerical score (0-15) will be assigned based onWhether the program can run using actual

biological data (3)

Documentations (3)

Whether it is easy to use (3)

Performance in accuracy (3)

Performance in computing time and memory usage (3)

Page 6: RNA Secondary Structure Prediction

Outline

RNA Secondary Structure

Comparative Approach

Base-Pair Maximization

Free Energy Minimization

Local Structure Prediction

Page 7: RNA Secondary Structure Prediction

RNA Types

siRNA, short interfering RNA; miRNA, microRNA; small temporal RNA stRNA; snoRNA small nucleolar RNA ;snRNA: Small nuclear RNA.

Page 8: RNA Secondary Structure Prediction

Features of RNA

RNA: polymer composed of a combination of four nucleotidesadenine (A)

cytosine (C)

guanine (G)

uracil (U)

Page 9: RNA Secondary Structure Prediction

Features of RNA

G-C and A-U form complementary hydrogen bonded base pairs (canonical Watson-Crick)

G-C base pairs being more stable (3 hydrogen bonds) A-U base pairs less stable (2 bonds)

non-canonical pairs can occur in RNA -- most common is G-U

Page 10: RNA Secondary Structure Prediction

RNA Pairs

A-U G-C

G-U

Page 11: RNA Secondary Structure Prediction

5’ ACCACCUGCUGA 3’

Primary structure:

Secondary Structure

Tertiary structure:

RNA Structure Hierarchy

Page 12: RNA Secondary Structure Prediction

Secondary Structure Categories

Pseudoknots

Hairpin loop

Internal loop

Bulge loop

Hairpin loop

Stem

Internal loop

Bulge loop

Stem

Page 13: RNA Secondary Structure Prediction

tRNA structure

Page 14: RNA Secondary Structure Prediction

Circular Representation

Page 15: RNA Secondary Structure Prediction

Assumptions in Secondary Structure Prediction

Most likely structure similar to energetically most stable structure

Energy associated with any position is only influenced by local sequence and structure

Structure formed does not produce pseudoknots

Page 16: RNA Secondary Structure Prediction

Pseudoknot Kissing hairpins Hairpin-bulge

Do not obey “parentheses rule”

Exceptions

Page 17: RNA Secondary Structure Prediction

Outline

RNA Secondary Structure

Comparative Approach

Base-Pair Maximization

Free Energy Minimization

Local Structure Prediction

Page 18: RNA Secondary Structure Prediction

Inferring Structure By Comparative Sequence Analysis

First step is to calculate a multiple sequence alignment

Requires sequences be similar enough so that they can be initially aligned

Sequences should be dissimilar enough for correlated mutation to be detected

Page 19: RNA Secondary Structure Prediction

Mutual Information

fxi : frequency of a base in column i

fxixj : joint (pairwise) frequency of a base pair between columns i and j

Information ranges from 0 and 2 bits

If i and j are uncorrelated, mutual information is 0

ji ji

ji

jixx xx

xx

xxij ff

ffM

,2log

Page 20: RNA Secondary Structure Prediction

Mutual Information Plot

Page 21: RNA Secondary Structure Prediction

Mutual Information Plot

Page 22: RNA Secondary Structure Prediction

Outline

RNA Secondary Structure

Comparative Approach

Base-Pair Maximization

Free Energy Minimization

Local Structure Prediction

Page 23: RNA Secondary Structure Prediction

Dot Plot

Page 24: RNA Secondary Structure Prediction

Base-Pair Maximization

Find structure with the most base pairs

Efficient dynamic programming approach to this problem introduced by Nussinov (1970s).

Four ways to get the best structure between position i and j from the best structures of the smaller subsequences

Page 25: RNA Secondary Structure Prediction

Nussinov Algorithm

1)      Add i,j pair onto best structure found for subsequence i+1, j-1

2)      add unpaired position i onto best structure for subsequence i+1, j

3)      add unpaired position j onto best structure for subsequence i, j-1

4)      combine two optimal structures i,k and k+1, j

Page 26: RNA Secondary Structure Prediction

Notation:

e(ri,rj) : free energy of a base pair joining ri

and rj

S(i,j) : optimal free energy associated with segment ri…rj

Dynamic Programming - 1

Page 27: RNA Secondary Structure Prediction

1. i is unpaired, added on toa structure for i+1…j

S(i,j) = S(i+1,j)

2. j is unpaired, added on toa structure for i…j-1

S(i,j) = S(i,j-1)

Dynamic Programming - 2

Page 28: RNA Secondary Structure Prediction

3. i j paired, added on toa structure for i+1…j-1S(i,j) = S(i+1,j-1)+e(ri,rj)

4. i j paired, but not to each other;the structure for i…j adds together

structures for 2 sub regions, i…k and k+1…j

S(i,j) = max {S(i,k)+S(k+1,j)}i<k<j

Dynamic Programming - 3

Page 29: RNA Secondary Structure Prediction

Since there are only four cases, the optimal score S(i,j) is just the maximum of the four possibilities:

othereachtonotbutpairedjijkSkiS

pairbasejirrejiS

unpairedrjiS

unpairedrjiS

jiS

jki

ji

j

i

,,),1(),(max

,),()1,1(

)1,(

),1(

max),(

Dynamic Programming - 4

Page 30: RNA Secondary Structure Prediction

Initialisation:

No close basepairs

A U A C C C U G U G G U A U

A 0 0 0 0

U 0 0 0 0

A 0 0 0 0

C 0 0 0 0

C 0 0 0 0

C 0 0 0 0

U 0 0 0 0

G 0 0 0 0

U 0 0 0 0

G 0 0 0 0

G 0 0 0 0

U 0 0 0

A 0 0

U 0

i

j

Page 31: RNA Secondary Structure Prediction

Propagation:A U A C C C U G U G G U A U

A 0 0 0 0 0

U 0 0 0 0 0

A 0 0 0 0 2

C 0 0 0 0 3

C 0 0 0 0 0

C 0 0 0 0 3

U 0 0 0 0 1

G 0 0 0 0 1

U 0 0 0 0 2

G 0 0 0 0 1

G 0 0 0 0

U 0 0 0

A 0 0

U 0

j

C5….U9 :

C5 unpaired:S(6,9) = 0

U10 unpaired:S(5,8)=0

C5-U10 pairedS(6,8) +e(C,U)=0

C5 paired, U10 paired:S(5,6)+S(7,9)=0S(5,7)+S(8,9)=0

Page 32: RNA Secondary Structure Prediction

Propagation:

A U A C C C U G U G G U A U

A 0 0 0 0 0 0 2

U 0 0 0 0 0 2 3

A 0 0 0 0 2 3 5

C 0 0 0 0 3 3 3

C 0 0 0 0 0 3 6

C 0 0 0 0 3 3

U 0 0 0 0 1 1

G 0 0 0 0 1 2

U 0 0 0 0 2 2

G 0 0 0 0 1

G 0 0 0 0

U 0 0 0

A 0 0

U 0

j

C5….G11 :

C5 unpaired:S(6,11) = 3

G11 unpaired:S(5,10)=3

C5-G11 pairedS(6,10)+e(C,G)=6

C5 paired, G11 paired:S(5,6)+S(7,11)=1S(5,7)+S(8,11)=0S(5,8)+S(9,11)=0S(5,9)+S(10,11)=0

Page 33: RNA Secondary Structure Prediction

Propagation:

A U A C C C U G U G G U A U

A 0 0 0 0 0 0 2 3 5 6 6 8 10 12

U 0 0 0 0 0 2 3 5 6 6 8 10 10

A 0 0 0 0 2 3 5 5 6 8 8 8

C 0 0 0 0 3 3 3 6 6 6 6

C 0 0 0 0 0 3 6 6 6 6

C 0 0 0 0 3 3 3 3 3

U 0 0 0 0 1 1 3 3

G 0 0 0 0 1 2 2

U 0 0 0 0 2 2

G 0 0 0 0 1

G 0 0 0 0

U 0 0 0

A 0 0

U 0

i

j

Page 34: RNA Secondary Structure Prediction

Traceback:A U A C C C U G U G G U A U

A 0 0 0 0 0 0 2 3 5 6 6 8 10 12

U 0 0 0 0 0 2 3 5 6 6 8 10 10

A 0 0 0 0 2 3 5 5 6 8 8 8

C 0 0 0 0 3 3 3 6 6 6 6

C 0 0 0 0 0 3 6 6 6 6

C 0 0 0 0 3 3 3 3 3

U 0 0 0 0 1 1 3 3

G 0 0 0 0 1 2 2

U 0 0 0 0 2 2

G 0 0 0 0 1

G 0 0 0 0

U 0 0 0

A 0 0

U 0

i

j

Page 35: RNA Secondary Structure Prediction

A U

U A

A U

C G

C G

C U

U G

AUACCCUGUGGUAU

Total free energy: -12 kcal/mol

Final Prediction

Page 36: RNA Secondary Structure Prediction

Computational complexity: N3

Does not work with pseudo-knot (would invalidate DP algorithm)

Methods that include pseudo knots:

Rivas and Eddy, JMB 285, 2053 (1999)

These methods are at least N6

Some Notes

Page 37: RNA Secondary Structure Prediction

Outline

RNA Secondary Structure

Comparative Approach

Base-Pair Maximization

Free Energy Minimization

Local Structure Prediction

Page 38: RNA Secondary Structure Prediction

Energy Minimization Methods

RNA folding is determined by biophysical properties

Energy minimization algorithm predicts the correct secondary structure by minimizing the free energy (G)

G calculated as sum of individual contributions of: loops base pairs secondary structure elements

Energies of stems calculated as stacking contributions between neighboring base pairs

Page 39: RNA Secondary Structure Prediction

Example for Thermodynamic Parameters

Page 40: RNA Secondary Structure Prediction

Calculating Best Structure

sequence is compared against itself using a dynamic programming approach similar to the maximum base-paired structure

instead of using a scoring scheme, the score is based upon the free energy values

Gaps represent some form of a loop

The most widely used software that incorporates this minimum free energy algorithm is MFOLD.

Page 41: RNA Secondary Structure Prediction

How well do they perform?

Current RNA folding programs get about 60-70% of base pairs correct, on average: useful, but not yet good.

The problem is the scoring system: thermodynamic model is accurate within 5-10%, and many alternative structures are within 10%.

Possible solution: combination of thermodynamic score with comparative sequence information

Page 42: RNA Secondary Structure Prediction

Outline

RNA Secondary Structure

Comparative Approach

Base-Pair Maximization

Free Energy Minimization

Local Structure Prediction

Page 43: RNA Secondary Structure Prediction

RNA Motif in HIV

TAR motif:

Transactivating Response Element

Page 44: RNA Secondary Structure Prediction

RNA Motifs Associated with Transcription termination

Rho-independent terminator stop the transcription process via its hairpin structure

Page 45: RNA Secondary Structure Prediction

Algorithm in Rnall

Definition 1. A “match” : canonical base pairs Definition 2. A “mismatch”: non-canonical base pair Definition 3. An “insertion”/“deletion”: nucleotide unpaired

Page 46: RNA Secondary Structure Prediction

RNA LSS in HIVTAR (30)

DIS (260)

PolyA (82)SD (292)

PSI (319)

Page 47: RNA Secondary Structure Prediction

Comparative RNA web sitehttp://www.rna.icmb.utexas.edu/

RNA worldhttp://www.imb-jena.de/RNA.html

RNA page by Michael Sukerhttp://www.bioinfo.rpi.edu/~zukerm/rna/

RNA structure database http://www.rnabase.org/

http://ndbserver.rutgers.edu/ (nucleic acid database)http://prion.bchs.uh.edu/bp_type/ (non canonical bases)

RNA structure classificationhttp://scor.berkeley.edu/

RNA visualisationhttp://ndbserver.rutgers.edu/services/download/index.html#rnaview http://rutchem.rutgers.edu/~xiangjun/3DNA/

Some RNA Resource

Page 48: RNA Secondary Structure Prediction

Reading Assignments

Suggested reading: Chapter 14 in “Current Topics in

Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.”

Optional reading: http://www.bioinfo.rpi.edu

/~zukerm/seqanal/mfold-3.0-manual.pdf