RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel...

158
Preamble Introduction Inference Search RNA Structure Prediction & Dabase Search Marcel Turcotte School of Electrical Engineering and Computer Science University of Ottawa Version of October 21, 2013 www.eecs.uottawa.ca/turcotte/teaching/bnf-5106/ Marcel Turcotte RNA Structure Prediction & Dabase Search

Transcript of RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel...

Page 1: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

RNA Structure Prediction & Dabase Search

Marcel Turcotte

School of Electrical Engineering and Computer ScienceUniversity of Ottawa

Version of October 21, 2013www.eecs.uottawa.ca/∼turcotte/teaching/bnf-5106/

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 2: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

About MeTake home messageOutline

Stanford’s Nobel chemistry prize honors computer science

Martin KarplusStrasbourg & Harvard

Michael LevittStanford

Arieh WarshelSouthern California

Lisa M. Krieger, San Jose Mercury News, 2013-10-09

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 3: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

About MeTake home messageOutline

About Me

I 1989-95 Ph.D. Universite MontrealLapalme (CS), Cedergren (Biochemistry)RNA Tertiary Structure Prediction (MC-Sym)

I 1995-97 Postdoc University of FloridaBenner (Chemistry)Protein Secondary Structure Prediction

I 1997-2000 Postdoc Imperial Cancer Research Fund/UKSternberg (Biomolecular Modelling)Protein Tertiary Structure Motifs Discovery & ML

I 2000- Associate Professor University of Ottawa (EECS)RNA Secondary Structure Prediction, Motif Inference andPattern Matching (eXtended-Dynalign, Profile-Dynalign,Seed, RiboFSM), ACSEA, ModuleInducer

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 4: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

About MeTake home messageOutline

Take home message

I With RNAs, base pair patterns are more preserved thansequence

I Consequently, traditional bioinformatics tools are generally notwell adapted to RNA research

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 5: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

About MeTake home messageOutline

Take home message

I With RNAs, base pair patterns are more preserved thansequence

I Consequently, traditional bioinformatics tools are generally notwell adapted to RNA research

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 6: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference
Page 7: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

About MeTake home messageOutline

1 PreambleAbout MeTake home messageOutline

2 IntroductionKey DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

3 InferenceDefinitionsComparative Sequence AnalysisThermodynamicsConsensus

4 SearchFirst Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 8: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA Catalyses Reaction

I Ribozymes (RNA enzymes) discovery, early 1980s

I The Nobel Prize in Chemistry 1989

Thomas R. Cech (Colorado) Sidney Altman ab (Yale)

aBorn in MontrealbGuest member uOttawa OISB

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 9: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA World

I 1986, RNA World hypothesis

I RNA has the ability to store information, as DNA does

I RNA has the ability to catalyze reactions, as proteins do

I RNA is an ideal candidate for an earlier simple form of life

Walter Gilbert(Nobel Prize in Chemistry 1980)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 10: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Small RNAs as 2002 Science Breakthrough

“Researchers are discoveringthat small RNA molecules play asurprising variety of key roles incells. They can inhibit transla-tion of messenger RNA intoprotein, cause degradationof other messenger RNAs,and even initiate completesilencing of gene expressionfrom the genome.”

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 11: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA Controls Gene Expression

I The Nobel Prize in Physiology or Medicine 2006

I RNA interference, gene silencing by double-stranded RNA

I An other key protein function

Andrew Z. Fire(Stanford)

Craig C. Mello(Massachusetts Medical School)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 12: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

The Nobel Prize in Chemistry 2009

“for studies of the structure and function of the ribosome”

VenkatramanRamakrishnan

MRC Laboratory of Molecular

Biology Cambridge, United

Kingdom

Thomas A. SteitzYale University, Howard Hughes

Medical Institute New Haven, CT,

USA

Ada E. YonathWeizmann Institute of Science

Rehovot, Israel

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 13: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Non-Protein-Coding RNA (ncRNA) in Protein Synthesis

mRNA: messager RNAs carries genetic information

tRNA: transfer RNAs are adapter molecules that recognize mRNAcodons and carry a specific amino acid

rRNA: ribosomal RNAs account for 2/3 of the molecular mass of theribosome, which is a large RNA/protein complex responsiblefor translating genomic information (stored in mRNAs) intoproteins

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 14: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Non-coding RNAs

miRNA: microRNAs modulate the development in C. elegans,Drosophila, and mammals (∼20 nt)

snRNA: small nuclear RNAs are involved in splicing of eukaryoticmRNAs (∼200 nt)

snoRNA: small nucleolar RNA direct nucleotide modifications in rRNAs(∼100 nt)

gRNA: guide RNAs play an important role in editing of certainmRNAs in trypanosomes (∼ 70 nt)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 15: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Non-Coding RNAs (contd)

tmRNA: have the combined features of tRNAs and mRNAs and plays arole in translation regulation in bacterial genomes (∼400 nt)

SRP: (signal recognition particle RNA-protein complex) directsnewly synthesized proteins through the endoplasmic reticulum

M1 RNA: is the catalytic part of Ribonuclease P in bacteria, involves inthe maturation of pre-tRNA (∼375 nt)

TERC: telomerase RNA is an integral part of telomerase enzyme thatserves as a template for the synthesis of the telomeres (∼450nt)

. . .

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 16: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Rfam database

I Rfam 11.0 (August 2012) contains 2,208 RNA sequencefamilies

I For each familyI Multiple sequence alignment (seed, full)I Consensus secondary structure (from literature or predicted)I Covariance model

I rfam.sanger.ac.uk and rfam.janelia.org

I Burge,S.W. et al. (2013) Rfam 11.0: 10 years of RNAfamilies. Nucleic Acids Res, 41, D22632,10.1093/nar/gks1005.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 17: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

ENCODE

I “The human genome is pervasively transcribed, such that themajority of its bases are associated with at least oneprimary transcript (. . . )”

Birney et al. Identification and analysis of functional elements in1% of the human genome by the ENCODE pilot project. Nature(2007) vol. 447 (7146) pp. 799-816

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 18: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

How Many Non-Coding RNAs?

I 48,479 candidates in the human genome (EvoFold)Pedersen et al. Identification and classification of conservedRNA secondary structures in the human genome. PLoSComput Biol (2006) vol. 2 (4) pp. e33

I Studies based on the ENCODE data setI 3,267 RNAz, 3,134 EvolFold

Washietl et al. Structured RNAs in the ENCODE selectedregions of the human genome. Genome Res (2007) vol. 17 (6)pp. 852-64

I 4,933 CMfinderTorarinsson et al. Comparative genomics beyondsequence-based alignments: RNA structures in the ENCODEregions. Genome Res (2008) vol. 18 (2) pp. 242-51

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 19: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

Protein versus ncRNA annotations

I Bussotti,G. et al. (2013) Detecting and comparing non-coding RNAs inthe high-throughput era. Int J Mol Sci, 14, 15423-15458.

Page 20: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

RNA Elements, Regulation, Riboswitches, Localisation. . .

I Jacobs,E. et al. (2012) The Role of RNA Structure in PosttranscriptionalRegulation of Gene Expression. J Genet Genomics, 39, 535543.

Page 21: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Action Mechanisms

I direct base-pairing with RNA or DNA target: snoRNAs,miRNAs

I mimic the structure of other nucleic acids (or proteins?):tmRNA, some snRNAs, IRES

I catalyst: RNAs P

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 22: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

John S. Mattick

I 234 publications, 6837 citations

I h-index = 62

I Garvan Institute of Medical Research, Division ofNeuroscience, Sydney, Australia

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 23: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

Smith et al. (2013) Widespread purifying selection on RNAstructure in mammals. Nucleic Acids Res, 41, 8220-8236.Amaral,P.P. et al. (2008) The eukaryotic genome as an RNAmachine. Science, 319, 17871789.Mattick et al. Non-coding RNA. Hum Mol Genet (2006) vol. 15Spec No 1 pp. R17-29Carninci et al. The transcriptional landscape of themammalian genome. Science (2005) vol. 309 (5740) pp.1559-63Mattick. RNA regulation: a new genetics?. Nat Rev Genet(2004) vol. 5 (4) pp. 316-23Mattick. Challenging the dogma: the hidden layer ofnon-protein-coding RNAs in complex organisms. Bioessays(2003) vol. 25 (10) pp. 930-9Mattick et al. The evolution of controlled multitasked genenetworks: the role of introns and other noncoding RNAs in thedevelopment of complex organisms. Mol Biol Evol (2001) vol. 18(9) pp. 1611-30

Page 24: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Fascinating RNAs

I Versatile molecules that can carry information, as DNA does,and perform catalytic functions, as proteins do

I Seem to be governed by simpler laws, as a result RNA analysisis a big bioinformatics success (see Gutell’s work on predictingsecondary and tertiary interactions, and Major’s work onpredicting tertiary structure)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 25: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Fascinating RNAs

I Versatile molecules that can carry information, as DNA does,and perform catalytic functions, as proteins do

I Seem to be governed by simpler laws, as a result RNA analysisis a big bioinformatics success (see Gutell’s work on predictingsecondary and tertiary interactions, and Major’s work onpredicting tertiary structure)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 26: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Database Search Problem

Find all GenBank gene’s that are similar to Clostridiumbotulinum’s toxin

>gi|27867582(fragment of the known Clostridium botuninum toxin gene)

GTGAATCAGCACCTGGACTTTCAGATGAAAAATTAAATTTAACTATCCAAAATGATGCTTATATACCAAAATATGATTCTAATGGAACAA

GTGATATAGAACAACATGATGTTAATGAACTTAATGTATTTTTCTATTTAGATGCACAGAAAGTGCCCGAAGGTGAAAATAATGTCAATC

TCACCTCTTCAATTGATACAGCATTATTAGAACAACCTAAAATATATACATTTTTTTCATCAGAATTTATTAATAATGTCAATAAACCTG

TGCAAGCAGC

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 27: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Database Search Problem

>gi|49138|emb|X68262.1|CBBONTF C.barati gene for type F neurotoxin

Length=4073 Score = 81.8 bits (41), Expect = 1e-12

Identities = 99/121 (82.82%), Gaps = 2/121 (0.02%)

Strand=Plus/Plus

Query 48 CAAAATGATGCTTATATACCAAAATATGATTCTAATGGAACAAGTGATATAGAACAACAT 107

||||||||| |||| | |||||||||||||||||||| |||||||| ||| | || ||

Sbjct 1712 CAAAATGATTCTTACGTTCCAAAATATGATTCTAATGGTACAAGTGAAATAAA-GAATAT 1771

Query 108 GATGTTAATGAACTTAATGTATTTTTCTATTTAGATGCACAGAAAGTGCC-GAAGGTGAA 167

|||| || |||| |||||||||||||||||| ||||||| |||| || |||||||||

Sbjct 1772 ACTGTTGATAAACTAAATGTATTTTTCTATTTATATGCACAAAAAGCTCCTGAAGGTGAA 1831

Query 168 A 168 |

Sbjct 1832 A 1832

...

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 28: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

How does it work?

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 29: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

An optimal pairwise alignment is obtained by extending:

I An optimal alignment with one more residue from eachsequence (match or mismatch)

I An optimal alignment with one residue from the first sequenceand a gap symbol (deletion)

I An optimal alignment with one residue from the secondsequence and a gap symbol (insertion)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 30: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

aln( ATATAGAACAAC, AATAAAGGAAT ) is the maximum of:

I aln( ATATAGAACAA, AATAAAGGAA ) + substitute( C,T )

ATATAGAACAA C

AATAAAGGAA T

I aln( ATATAGAACAA, AATAAAGGAAT ) + delete( C )

ATATAGAACAA C

AATAAAGGAAT -

I aln( ATATAGAACAAC, AATAAAGGA ) + insert( T )

ATATAGAACAAC -

AATAAAGGAA T

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 31: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

aln( ATATAGAACAAC, AATAAAGGAAT ) is the maximum of:

I aln( ATATAGAACAA, AATAAAGGAA ) + substitute( C,T )

ATATAGAACAA C

AATAAAGGAA T

I aln( ATATAGAACAA, AATAAAGGAAT ) + delete( C )

ATATAGAACAA C

AATAAAGGAAT -

I aln( ATATAGAACAAC, AATAAAGGA ) + insert( T )

ATATAGAACAAC -

AATAAAGGAA T

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 32: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

aln( ATATAGAACAAC, AATAAAGGAAT ) is the maximum of:

I aln( ATATAGAACAA, AATAAAGGAA ) + substitute( C,T )

ATATAGAACAA C

AATAAAGGAA T

I aln( ATATAGAACAA, AATAAAGGAAT ) + delete( C )

ATATAGAACAA C

AATAAAGGAAT -

I aln( ATATAGAACAAC, AATAAAGGA ) + insert( T )

ATATAGAACAAC -

AATAAAGGAA T

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 33: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

aln( ATATAGAACAAC, AATAAAGGAAT ) is the maximum of:

I aln( ATATAGAACAA, AATAAAGGAA ) + substitute( C,T )

ATATAGAACAA C

AATAAAGGAA T

I aln( ATATAGAACAA, AATAAAGGAAT ) + delete( C )

ATATAGAACAA C

AATAAAGGAAT -

I aln( ATATAGAACAAC, AATAAAGGA ) + insert( T )

ATATAGAACAAC -

AATAAAGGAA T

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 34: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

I Positions along the sequence are independent and identicallydistributed i.i.d.

I Independence is necessary for the development of efficientexact (Smith-Waterman) or heuristics (such as BLAST)algorithms

I The execution time of the exact algorithms growsproportionally to the product of the size of the database timesthe size of the input sequence

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 35: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

I Positions along the sequence are independent and identicallydistributed i.i.d.

I Independence is necessary for the development of efficientexact (Smith-Waterman) or heuristics (such as BLAST)algorithms

I The execution time of the exact algorithms growsproportionally to the product of the size of the database timesthe size of the input sequence

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 36: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Pairwise Sequence Alignment

I Positions along the sequence are independent and identicallydistributed i.i.d.

I Independence is necessary for the development of efficientexact (Smith-Waterman) or heuristics (such as BLAST)algorithms

I The execution time of the exact algorithms growsproportionally to the product of the size of the database timesthe size of the input sequence

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 37: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA Sequence Alignment (Toy Example)

1 GUCGAGAGAC

!!!!!

2 GUCGAAGCUG

!!!!!

3 CAGAGAGCUG

1 and 2 are 50% identical (similarly for 2 and 3), however, 1 and 3don’t seem to have anything in common

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 38: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA Sequence Alignment (Toy Example)

1 GUCGAGAGAC

!!!!!

2 GUCGAAGCUG

!!!!!

3 CAGAGAGCUG

1 and 2 are 50% identical (similarly for 2 and 3), however, 1 and 3don’t seem to have anything in common

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 39: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA Sequence Alignment (Toy Example)

G A A A A G

A G G G G A

G-C C C C-G

A-U U U U-A

C-G G G G-C

CAGAGAGCUG GUCGAAGCUG GUCGAGAGAC

1 2 3

Yes, but sequences 1 and 3 share the same secondary structure!

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 40: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

RNA Sequence Alignment (Toy Example)

G A A A A G

A G G G G A

G-C C C C-G

A-U U U U-A

C-G G G G-C

CAGAGAGCUG GUCGAAGCUG GUCGAGAGAC

1 2 3

Yes, but sequences 1 and 3 share the same secondary structure!

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 41: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Caveat

I RNAs conserve secondary structure interactions more thanthey conserve their sequence

I Traditional bioinformatics tools, assuming that positions areindependent, perform poorly

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 42: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Caveat

I RNAs conserve secondary structure interactions more thanthey conserve their sequence

I Traditional bioinformatics tools, assuming that positions areindependent, perform poorly

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 43: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

Key DiscoveriesRNA ContinentChallenges for Traditional Bioinformatics Tools

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 44: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Stems, hairpins, bulges, interior and multi-branch loopsplt22ps by D. Stewart and M. Zuker

' 2004 Washington University

dG = 0.0 [initially 0.0] K02682

N

G

U

C

U

G

G

C

G

G

CCA

U

A GC

GU

GG

GG

GAA

A

CG

CC

CG

GUCC

C

A

U

C

CC G

A

A

C

CC

GGAA

GC

UA

A

G

CC

CC

AU A

GC

GC C G A U G G

U AC U G C A A C C

GG

GA

GGUUGUGG

GAGAGUAGG

U

CG

C

C

G

C

C

G

G

A

C

A

20

40

60

80

100

120

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 45: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Secondary Structure Prediction

I X ray crystallography and N.M.R.

I Chemical and enzymatic probing, cross-linking

I Comparative sequence analysis

I Minimum free energy (MFE) methods

I Consensus (Comparative sequence analysis + MFE)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 46: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Secondary Structure Prediction

I X ray crystallography and N.M.R.

I Chemical and enzymatic probing, cross-linking

I Comparative sequence analysis

I Minimum free energy (MFE) methods

I Consensus (Comparative sequence analysis + MFE)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 47: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Secondary Structure Prediction

I X ray crystallography and N.M.R.

I Chemical and enzymatic probing, cross-linking

I Comparative sequence analysis

I Minimum free energy (MFE) methods

I Consensus (Comparative sequence analysis + MFE)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 48: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Secondary Structure Prediction

I X ray crystallography and N.M.R.

I Chemical and enzymatic probing, cross-linking

I Comparative sequence analysis

I Minimum free energy (MFE) methods

I Consensus (Comparative sequence analysis + MFE)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 49: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Secondary Structure Prediction

I X ray crystallography and N.M.R.

I Chemical and enzymatic probing, cross-linking

I Comparative sequence analysis

I Minimum free energy (MFE) methods

I Consensus (Comparative sequence analysis + MFE)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 50: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instanceI Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)I i < j < i ′ < j ′ (i .j precedes i ′.j ′)I i < i ′ < j ′ < j (i .j includes i ′.j ′)I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 51: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instance

I Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)I i < j < i ′ < j ′ (i .j precedes i ′.j ′)I i < i ′ < j ′ < j (i .j includes i ′.j ′)I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 52: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instanceI Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)I i < j < i ′ < j ′ (i .j precedes i ′.j ′)I i < i ′ < j ′ < j (i .j includes i ′.j ′)I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 53: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instanceI Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)

I i < j < i ′ < j ′ (i .j precedes i ′.j ′)I i < i ′ < j ′ < j (i .j includes i ′.j ′)I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 54: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instanceI Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)I i < j < i ′ < j ′ (i .j precedes i ′.j ′)

I i < i ′ < j ′ < j (i .j includes i ′.j ′)I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 55: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instanceI Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)I i < j < i ′ < j ′ (i .j precedes i ′.j ′)I i < i ′ < j ′ < j (i .j includes i ′.j ′)

I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 56: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Given an RNA sequence S = s1, s2 . . . sn, where si is the ithnucleotide. A secondary structure is an ordered list of pairs, i .j ,1 ≤ i < j ≤ n such that:

I j − i ≥ c , where c = 4 for instanceI Given i .j and i ′.j ′, two base pairs, then either:

I i = i ′ and j = j ′ (they are the same)I i < j < i ′ < j ′ (i .j precedes i ′.j ′)I i < i ′ < j ′ < j (i .j includes i ′.j ′)I i < i ′ < j < j ′ (pseudoknot)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 57: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

i < j < i ′ < j ′ (i .j precedes i ′.j ′)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 58: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

i < i ′ < j ′ < j (i .j includes i ′.j ′)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 59: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Problems

I Reporting sub-optimal structures (MFOLD, SFOLD)

I Partition function and the McCaskill’s calculation of Pij ’s

I Folding kinetics, identifying ribo-switches

I MFE for secondary structure for interacting RNA molecules

I Partition function for secondary structure for interacting RNAmolecules

I Non-protein-coding gene identification (EvoFold, RNAz. . . )

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 60: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Problems

I Reporting sub-optimal structures (MFOLD, SFOLD)

I Partition function and the McCaskill’s calculation of Pij ’s

I Folding kinetics, identifying ribo-switches

I MFE for secondary structure for interacting RNA molecules

I Partition function for secondary structure for interacting RNAmolecules

I Non-protein-coding gene identification (EvoFold, RNAz. . . )

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 61: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Problems

I Reporting sub-optimal structures (MFOLD, SFOLD)

I Partition function and the McCaskill’s calculation of Pij ’s

I Folding kinetics, identifying ribo-switches

I MFE for secondary structure for interacting RNA molecules

I Partition function for secondary structure for interacting RNAmolecules

I Non-protein-coding gene identification (EvoFold, RNAz. . . )

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 62: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Problems

I Reporting sub-optimal structures (MFOLD, SFOLD)

I Partition function and the McCaskill’s calculation of Pij ’s

I Folding kinetics, identifying ribo-switches

I MFE for secondary structure for interacting RNA molecules

I Partition function for secondary structure for interacting RNAmolecules

I Non-protein-coding gene identification (EvoFold, RNAz. . . )

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 63: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Problems

I Reporting sub-optimal structures (MFOLD, SFOLD)

I Partition function and the McCaskill’s calculation of Pij ’s

I Folding kinetics, identifying ribo-switches

I MFE for secondary structure for interacting RNA molecules

I Partition function for secondary structure for interacting RNAmolecules

I Non-protein-coding gene identification (EvoFold, RNAz. . . )

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 64: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Problems

I Reporting sub-optimal structures (MFOLD, SFOLD)

I Partition function and the McCaskill’s calculation of Pij ’s

I Folding kinetics, identifying ribo-switches

I MFE for secondary structure for interacting RNA molecules

I Partition function for secondary structure for interacting RNAmolecules

I Non-protein-coding gene identification (EvoFold, RNAz. . . )

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 65: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

worm

AAGACUUCGGAUCUGGCGACACCChuman

fly

mouse ACACUUCGGAUGACACCAAAGUG

orc AAGCCUUCGGAGCGGGCGUAACU

AGGUCUUCGGCACGGGCACCAUUCCAACUUCGGAUUUUGCUACCAUA

“Today, comparative analysis has become the method of choice forestablishing higher-order structure for large RNA.” Pace, Thomas,Woese (1999) In The RNA World. Cold Spring Harbor.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 66: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

mouse

flyorc

human ACACUUCGGAUGACACCAAAGUG

CAACUUCGGAUUUUGCUACCAUAAGGUCUUCGGCACGGGCACCAUUCworm

AAGACUUCGGAUCUGGCGACACCC

AAGCCUUCGGAGCGGGCGUAACU

“Today, comparative analysis has become the method of choice forestablishing higher-order structure for large RNA.” Pace, Thomas,Woese (1999) In The RNA World. Cold Spring Harbor.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 67: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

mouse

flyorc

human ACACUUCGGAUGACACCAAAGUG

CAACUUCGGAUUUUGCUACCAUAAGGUCUUCGGCACGGGCACCAUUCworm

AAGACUUCGGAUCUGGCGACACCC

AAGCCUUCGGAGCGGGCGUAACU

“Today, comparative analysis has become the method of choice forestablishing higher-order structure for large RNA.” Pace, Thomas,Woese (1999) In The RNA World. Cold Spring Harbor.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 68: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

AAGACUUCGGAUCUGGCGACACCChumanmouse

flyworm

AAGCCUUCGGAGCGGGCGUAACU

ACACUUCGGAUGACACCAAAGUG

CAACUUCGGAUUUUGCUACCAUAAGGUCUUCGGCACGGGCACCAUUC

orc

“Today, comparative analysis has become the method of choice forestablishing higher-order structure for large RNA.” Pace, Thomas,Woese (1999) In The RNA World. Cold Spring Harbor.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 69: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

AAGACUUCGGAUCUGGCGACACCChumanmouse

flyworm

AAGCCUUCGGAGCGGGCGUAACU

ACACUUCGGAUGACACCAAAGUG

CAACUUCGGAUUUUGCUACCAUAAGGUCUUCGGCACGGGCACCAUUC

orc

AA

UA

GC

AG

UU C

G

5’

CU

G G

C C

CG

CC

A

A

3’

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 70: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Saccharomyces cerevisiae ...CCAGACUGAAGAUCUGG...Spiroplasma meliferum CCUGCCUUGCACGCAGGMycoplasma capricolum CCUCCCUGUCACGGAGGMycoplasma mycoides CACGGUUUUCAUCCGUGSpiroplasma meliferum UUUGAUUGAAGCUCAAAStreptomyces lividans ACGGCCUGCAAAGCCGU

30 35.. .

40

25 30 35 40 45

2530

3540

45

0.0 0.4 0.8

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 71: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

I Starts with the alignment of a set of homologous sequences

I Detecting correlated pairs of sites

I Parallel chords implies helices (stems)I Others are tertiary structure interactions

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 72: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

I Starts with the alignment of a set of homologous sequencesI Detecting correlated pairs of sites

I Parallel chords implies helices (stems)I Others are tertiary structure interactions

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 73: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

I Starts with the alignment of a set of homologous sequencesI Detecting correlated pairs of sites

I Parallel chords implies helices (stems)

I Others are tertiary structure interactions

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 74: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

I Starts with the alignment of a set of homologous sequencesI Detecting correlated pairs of sites

I Parallel chords implies helices (stems)I Others are tertiary structure interactions

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 75: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Detecting Correlated Pairs

I Chi-square test of independenceI Mutual information

I M(I , J) = H(I ) + H(J)− H(I , J)where H(I ) = −

∑α P(i = α) logP(i = α)

and H(I , J) = −∑

αβ P(i = α, j = β) logP(i = α, j = β)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 76: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Entropy or uncertainty, one variable, two outcomes

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

p

H

The above picture shows how the entropy varies as a function of p

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 77: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Accuracy of comparative analysis on rRNAs

I Late 1970’s, comparative sequence analysis

I 16S ∼ 1500 nt long, 23S ∼ 3000 nt long

I 4.3× 10393 and 6.3× 10740 possible secondary structures

I 2000, high-resolution crystal structures of rRNAs produced

I Gutell et al. The accuracy of ribosomal RNA comparativestructure models. Curr Opin Struct Biol (2002) vol. 12 (3)pp. 301-10

I “97–98% of the base pairings predicted with covariationanalysis are indeed present in the 16S and 23S rRNAcrystal structures”

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 78: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

What are the main difficulties?

I Needs an alignment, but sequence alignment techniques arenot well adapted for RNA sequences

I To produce a high quality alignment, the sequences should besimilar

I If the sequences are similar, there will be few observedcompensatory changes

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 79: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

What are the main difficulties?

I Needs an alignment, but sequence alignment techniques arenot well adapted for RNA sequences

I To produce a high quality alignment, the sequences should besimilar

I If the sequences are similar, there will be few observedcompensatory changes

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 80: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

What are the main difficulties?

I Needs an alignment, but sequence alignment techniques arenot well adapted for RNA sequences

I To produce a high quality alignment, the sequences should besimilar

I If the sequences are similar, there will be few observedcompensatory changes

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 81: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

What are the main difficulties?

I Needs an alignment, but sequence alignment techniques arenot well adapted for RNA sequences

I To produce a high quality alignment, the sequences should besimilar

I If the sequences are similar, there will be few observedcompensatory changes

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 82: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 83: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

RNA Folding

I How to search the space of all possible secondary structures?I How to select the best structure?

I Maximizing the number of base-pairs (Nussinov)I Maximizing the number of hydrogen bondsI Minimizing the free energy (Zuker/MFOLD)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 84: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Maximizing the number of pairs for the segment i ..j

i jj−1i j

i+1

j−1i+1

i j ji j−1j−1

ji+1ii ji+1 i j i k jk+1k k+1

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 85: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Maximizing the number of pairs for the segment i ..j

i jj−1i j

i+1

j−1i+1

i j ji j−1j−1

ji+1ii ji+1 i j i k jk+1k k+1

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 86: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Maximizing the number of pairs for the segment i ..j

i jj−1i j

i+1

j−1i+1

i j ji j−1j−1

ji+1ii ji+1 i j i k jk+1k k+1

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 87: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Maximizing the number of pairs for the segment i ..j

i jj−1i j

i+1

j−1i+1

i j ji j−1j−1

ji+1ii ji+1

i j i k jk+1k k+1

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 88: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Maximizing the number of pairs for the segment i ..j

i jj−1i j

i+1

j−1i+1

i j ji j−1j−1

ji+1ii ji+1 i j i k jk+1k k+1

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 89: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Nussinov

Initialization:

γ(i , i + k) = 0 for k = 0 to 3 and for i = 1 to n− k.

Recurrence:

γ(i , j) = max

γ(i + 1, j − 1) + δ(i , j);γ(i + 1, j);γ(i , j − 1);maxi<k<(j−1)[γ(i , k) + γ(k + 1, j)].

Matching score:

δ(i , j) =

{1, if ai : aj ∈ {A : U,U : A,G : C ,C : G}

⋃{G : U,U : G};

0, otherwise.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 90: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Nature does not use this strategy!

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 91: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

How about maximizing the number of hydrogen bonds?

+3 for G.C pairs+2 for A.U pairs+1 for G.U pairs

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 92: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Nature does not use this strategy either!

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 93: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Nearest-neighbor model

C

C

C

A

G

−1.1 terminal mismatch hairpin

UA

C

A U

U

G

A

G

G

A A

U U

−2.9 stack

−2.1 stack

−1.8 stack−0.9 stack

−1.8 stack

1 nt bulge +3

4 nt loop +5.9

A

A

5’

3’unstructured ss 0.0

5’ dangle −0.3

−2.9 stack (special case 1 nt bulge)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 94: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Simplified Zuker (MFOLD) algorithm

W (i , j) = min

W (i + 1, j),W (i , j − 1),V (i , j),mini≤k<j [W (i , k) + W (k + 1, j)].

where V models a segment such that i and j form a base pair.

V (i , j) = min

V1(i , j), hairpin closed by i • jV2(i , j), helix extension, bulge, interior loopV3(i , j), multiple loop

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 95: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Simplified Zuker (MFOLD) algorithm

V2 extending a helix, bulge, or interior loop:

V2(i , j) = mini<i ′<j ′<j

[e(motif) + V (i ′, j ′)]

V3 multi-branch loop structures:

V3(i , j) = mini+1<k<j−1

[e(motif) + W (i + 1, k) + W (k + 1, j − 1)]

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 96: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

MFOLD

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 97: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

MFOLD

I Sophisticated energy minimization program

I Developed by Mike Zuker (NRC/Ottawa, now at RPI)

I Finds the structure with the minimum equilibrium free energy(∆G), as approximated by neighboring base pair contributions

I Takes into account: stacking, hairpin loop lengths, bulge looplengths, interior loop lengths, multi-branch loop lengths, singledangling nucleotides and terminal mismatches on stems

I Takes O(n2) space and O(n3) time

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 98: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Performance of the Nearest-Neighbour Model(for a single sequence)

I The nearest-neighbour model works reasonably well for smallRNAs, 69 % and 71 % PPV (positive predictive value) forthe tRNA and 5S rRNA, which are approximately 80 and 120nucleotides long, respectively.

K. J. Doshi, J. J. Cannone C. W. Cobaugh, et R. R. Gutell(2004) Evaluation of the suitability of free-energyminimization using nearest-neighbor energy parameters forRNA secondary structure prediction. BMC Bioinformatics5(1):105.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 99: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Observations

I RNAs conserve secondary structure interactions more thanthey conserve their sequence

I The nearest-neighbour model performs well on average butfails for certain sequences

I Single-sequence methods can be generalised to determine aconsensus structure for more than one sequence

I As the number of input sequences increases it becomesunlikely that the nearest-neighbour model simultaneously failsfor all of them

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 100: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Observations

I RNAs conserve secondary structure interactions more thanthey conserve their sequence

I The nearest-neighbour model performs well on average butfails for certain sequences

I Single-sequence methods can be generalised to determine aconsensus structure for more than one sequence

I As the number of input sequences increases it becomesunlikely that the nearest-neighbour model simultaneously failsfor all of them

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 101: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Observations

I RNAs conserve secondary structure interactions more thanthey conserve their sequence

I The nearest-neighbour model performs well on average butfails for certain sequences

I Single-sequence methods can be generalised to determine aconsensus structure for more than one sequence

I As the number of input sequences increases it becomesunlikely that the nearest-neighbour model simultaneously failsfor all of them

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 102: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Observations

I RNAs conserve secondary structure interactions more thanthey conserve their sequence

I The nearest-neighbour model performs well on average butfails for certain sequences

I Single-sequence methods can be generalised to determine aconsensus structure for more than one sequence

I As the number of input sequences increases it becomesunlikely that the nearest-neighbour model simultaneously failsfor all of them

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 103: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

eXtended Dynalign

I David Sankoff (1985) Simultaneous solution of RNAfolding, alignment and protosequence problems. SIAM J.Appl. Math. 45(5):810–825.

I Objective function is a linear combination of the free energyof each sequence given the common secondary structure;

I D.H. Mathews et D.H. Turner (2002) Dynalign: AnAlgorithm for Finding the Secondary Structure Commonto Two RNA Sequences. J. Mol. Biol. 317:191–203.

I We extended this work for three sequences.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 104: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

eXtended Dynalign

I David Sankoff (1985) Simultaneous solution of RNAfolding, alignment and protosequence problems. SIAM J.Appl. Math. 45(5):810–825.

I Objective function is a linear combination of the free energyof each sequence given the common secondary structure;

I D.H. Mathews et D.H. Turner (2002) Dynalign: AnAlgorithm for Finding the Secondary Structure Commonto Two RNA Sequences. J. Mol. Biol. 317:191–203.

I We extended this work for three sequences.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 105: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

eXtended Dynalign

I David Sankoff (1985) Simultaneous solution of RNAfolding, alignment and protosequence problems. SIAM J.Appl. Math. 45(5):810–825.

I Objective function is a linear combination of the free energyof each sequence given the common secondary structure;

I D.H. Mathews et D.H. Turner (2002) Dynalign: AnAlgorithm for Finding the Secondary Structure Commonto Two RNA Sequences. J. Mol. Biol. 317:191–203.

I We extended this work for three sequences.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 106: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

eXtended Dynalign

I David Sankoff (1985) Simultaneous solution of RNAfolding, alignment and protosequence problems. SIAM J.Appl. Math. 45(5):810–825.

I Objective function is a linear combination of the free energyof each sequence given the common secondary structure;

I D.H. Mathews et D.H. Turner (2002) Dynalign: AnAlgorithm for Finding the Secondary Structure Commonto Two RNA Sequences. J. Mol. Biol. 317:191–203.

I We extended this work for three sequences.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 107: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Idea

I The objective function is a linear combination of the freeenergy of each sequence given the common structure;

∆G ◦total = ∆G ◦seq 1 + ∆G ◦seq 2 + ∆G ◦seq 3 + ∆G ◦insertions

I No terms for substitutions;

I Solved by dynamic programming: constructing an alignmentand a common secondary structure for S1[i , j ], S2[k , l ] andS3[m, n], from the smallest to the largest segment.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 108: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Idea

Score= -578

GCCCGGGTGGTGTAGTGGCCCATCATACGACCCTGTCACGGTCG-TGACGCGGGTTCAAATCCCGCCTCGGGCGCCA

GTCGCAATGGTG-TAGTTGGGAGCATGACAGACTGAAGATCTGTTGGTCATCGGTTCGATCCCGGTTTGTGACACCA

GCCCCCAUCGUCUAGAGGCCUAGGACACCUCCCUUUCACGGAGG-CGACAGGGAUUCGAAUUCCCUUGGGGGUACCA

(((((((..(((...........))).(((((.......))))).....(((((.......))))))))))))....

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 109: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

eXtended Dynalign

The recurrence equations describing the free energy are somewhatcomplex. There are 140 cases:V1,V2,V31−64 ,W1,W2,W31−64 ,W91−8 .Let S1,S2 and S3, be three RNA sequences.

I W (i , j ; k, l ;m, n) represents the some of the free energy ofS1[i , j ], given the common structure, S2[k , l ] given thecommon secondary structure and S3[m, n];

I V (i , j ; k , l ;m, n) is defined similarly to W but also imposesconstraints such that i is paired with j , k is paired with l , andm is paired with m;

I W 9 represents the free energy for a prefix alignment ofS1[1, j ], S2[1, l ] and S3[1, n].

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 110: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Hairpin loop closed by a base-pair: V1(i , j ; k , l ;m, n)

m n

k

i j

m n

S

S

S

1

2

lM

k l

i j

3

∆G ◦hairpin(i , j)+∆G ◦hairpin(k , l)+∆G ◦hairpin(m, n)+∆G ◦gap(no. of gaps)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 111: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Helix Extension: V2.1(i , j ; k , l ;m, n)

m n

n−1m+1

j

j−1

i

i+1

k l

k+1 l−1k l

j

m n

S

S

S

1

2

3

i

M

V (i+1, j−1; k+1, l−1;m+1, n−1)+∆G ◦motif1+∆G ◦motif2

+∆G ◦motif3

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 112: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Multibranch Loop: V3.1(i , j ; k , l ;m, n)

k l

i j

m n

S

S 1

2

3

M

e

c

g

j

e

c

g

S

i

k

n

l

m

W (i , c ; k , e;m, g)+W (c+1, j ; e+1, l ; g+1, n)+∆G◦motif1+∆G◦motif2

+∆G◦motif3

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 113: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Summary

I A base pair is predicted only if it simultaneously occurs in allthree sequences

I The algorithm finds a consensus structure

I An alignment is produced as a byproduct, it is reliable only inthe base paired regions, as no substitution scores are used

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 114: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

MFOLD: tRNAs

Id Sensitivity PPV MCCRD0260 33.3 29.2 31.2RD0500 47.6 43.5 45.5RD4800 42.9 56.2 49.1RE2140 95.2 87 91RE6781 33.3 28 30.6RF6320 0 0 0RL0503 0 0 0RL1141 40 43.5 41.7RS0380 52 56.5 54.2RS1141 19.2 25 21.9

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 115: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

MFOLD: 5S rRNAs

Id Sensitivity PPV MCCAJ131594 23.7 60 37.7AJ251080 26.3 45.5 34.6

D11460 15.8 37.5 24.3K02682 20.5 40 28.6M10816 31.6 70.6 47.2M16532 10.3 21.1 14.7M25591 26.3 45.5 34.6V00336 37.5 65.2 49.5X02024 15.8 37.5 24.3X02627 38.5 68.2 51.2X04585 0 0 0X08000 0 0 0X08002 0 0 0

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 116: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Are three input sequences better than two?

1 The worse prediction (minimum accuracy) should be moreaccurate;

2 Use of three input sequences should improve the averageaccuracy;

3 Average coverage should be less.

Masoumi, B. and Turcotte, M. (2005) Simultaneous alignment and structureprediction of three RNA sequences. Int. J. Bioinformatics Research andApplications. Vol. 1, No. 2, pp. 230-245

Beeta Masoumi and Marcel Turcotte. Simultaneous alignment and structure

prediction of RNAs: Are three input sequences better than two? In S. V.

Sunderam et al., editor, 2005 International Conference on Computational

Science (ICCS 2005), Lecture Notes in Computer Science 3515, pages 936-943,

Atlanta, USA, May 22-25 2005.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 117: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

PPV: tRNA Dataset

Id Nxd Nd Minxd Mind Maxxd Maxd Avexd AvedRD0260 4 5 100 80 100 100 100.0 96.0RD0500 4 5 76 45 100 100 82.2 80.8RD4800 5 5 100 80 100 100 100.0 96.0RE2140 2 4 100 100 100 100 100.0 100.0RE6781 2 4 100 77 100 100 100.0 94.3RF6320 4 5 95 45 100 100 96.4 89.1RL0503 1 2 100 100 100 100 100.0 100.0RL1141 2 3 100 70 100 100 100.0 90.3RS0380 1 2 100 83 100 87 100.0 85.2RS1141 2 3 100 70 100 100 100.0 90.3

xd stands for eXtended Dynalign, d stands for Dynalign.

X-Dynalign 96.8± 7.6 vs Dynalign 92.1± 14.6.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 118: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

eXtended-Dynalign reproduces the clover-leaf structure

plt22ps by D. Stewart and M. Zuker

© 2004 Washington University

dG = 0.0 [initially 0.0] RD0500

G

C

C

C

G

G

G

UG

GUGU

AGU

G

G

CC C A

U C A U

A C

G

A

C

CC

U

GU C

A

CG

G

U

C

G

U GA

CG C G G G

U UC

A

AAU

CCCGCC

U

C

G

G

G

CG

CCA

20

40

60

plt22ps by D. Stewart and M. Zuker

' 2004 Washington University

dG = 0.0 [initially 0.0] RD0500

G

C

C

C

G

G

G

U

G

GU

G

UA

G

U

GG

C

C

CA

U

C

AU

A

C

G

A

C

CC

U

GU C

A

C

G

G

U

C

G

U

GA

C

G

CG

G

G

UU

C

A

AA

U

C

C

C

G

C

CU

C

G

G

G

C

G

CC

A

20

40

60

plt22ps by D. Stewart and M. Zuker

© 2004 Washington University

dG = 0.0 [initially 0.0] RDRD0500

G

C

C

C

G

G

GU

GGUG

UAG

U

G

GC C

C A U CA

UA

C

G

A

CCC

UG

U CA

C

GG

U

C

G

U GA

CG C G G G

U UC

A

AAU

CCCGCC

U

C

G

G

G

CG

CCA

20

40

60

(a) RD0500 (b) Dynalign (c) X-Dynalign

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 119: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Fine details are better reproduced as well

plt22ps by D. Stewart and M. Zuker

© 2004 Washington University

dG = 0.0 [initially 0.0] RS0380

G

C

C

G

A

G

GUA

GCAU

AGCU

U

G

GC C A

A U G C

GG

UU

GCU

U

CG A

G

A

GC

AA

C G UU

CC

AC

A

CG

GA

CUC

A G G A GU U

C

A

AAU

CUCCUC

C

U

C

G

G

CG

CCA

20

40

60

80

plt22ps by D. Stewart and M. Zuker

© 2004 Washington University

dG = 0.0 [initially 0.0] RS0380

G

C

C

G

A

G

GTA

GCA

TA

GCT

T

G

GC

C AA

T G C

G

GT

TG

CTT

CG A

G

A

GC

AA

C GT

T

C

CA C

A

CGG

AC

T

C

A G G A GT T

C

A

AAT

CTCCTC

C

T

C

G

G

CG

CCA

20

40

60

80

plt22ps by D. Stewart and M. Zuker

© 2004 Washington University

dG = 0.0 [initially 0.0] RS0380

G

C

C

G

A

G

GT

A

GCA

TA

GCT

T

G

GC

C AA

T G C

GG

T

T

G

CTT

CG A

G

A

G

C

A

A

CG T

T

C

C

AC

A

CG

G

AC T

CA G G A G

T TC

A

AA

T

CTCCTC

C

T

C

G

G

CG

CCA

20

40

60

80

(a) RS0380 (b) Dynalign (c) X-Dynalign

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 120: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

PPV: 5S rRNA

Id Nxd Nd Minxd Mind Maxxd Maxd Avexd AvedAJ131594 2 3 100 91 100 100 100.0 94.5AJ251080 6 5 88 82 90 86 90.3 84.8D11460 6 5 87 66 87 88 87.6 79.4K02682 8 9 63 88 100 97 89.1 92.0M10816 3 4 90 85 90 88 90.7 87.8M16532 1 2 94 77 94 85 94.1 81.8M25591 6 5 87 82 90 86 89.8 84.8V00336 3 4 75 65 100 100 91.9 91.4X02024 9 6 88 82 90 88 90.1 85.8X02627 1 2 100 92 100 100 100.0 96.0X04585 2 3 72 68 94 93 83.4 82.7X08000 5 5 90 88 90 90 90.6 89.4X08002 5 5 90 88 90 90 90.6 89.4

X-Dynalign 90.3± 5.8, Dynalign = 87.7± 7.4.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 121: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

(K02682,V00336,X04585), PPV = 63%

plt22ps by D. Stewart and M. Zuker

' 2004 Washington University

dG = 0.0 [initially 0.0] K02682

N

G

U

C

U

G

G

C

G

G

CCA

U

A GC

GU

GG

GG

GAA

A

CG

CC

CG

GUCC

C

A

U

C

CC G

A

A

C

CC

GGAA

GC

UA

A

G

CC

CC

AU A

GC

GC C G A U G G

U AC U G C A A C C

GG

GA

GGUUGUGG

GAGAGUAGG

U

CG

C

C

G

C

C

G

G

A

C

A

20

40

60

80

100

120

plt22ps by D. Stewart and M. Zuker

' 2004 Washington University

dG = 0.0 [initially 0.0] K02682

N

G

U

C

U

G

G

C

G

G

CCA

U

A GC

GU

GG

GG

GAA

A

C

GC

CC

GG

UC

CC

A

UC C

C

GAAC

CC

GGAA

GC

U AA

G

CC

CC

AU A

GC G C C

GA U

GG U

A

C

U GC A A C C

GG

GA

GGUUG

UGG

GAG

A

GUA

GGUCG

C

C

G

C

C

G

G

A

C

A

20

40

60

80

100

120 N

G

U

C

U

G

G

C

G

G

CC

AUA

GC

GUG

GG

GG

AAA

C

GC

CC

GG

UCCC

A

U

CC

C GA

A

C

CC

GGAA

GC

U AA

G

CC

CC A U

AG

CG

C CG

AU

GGU A C

U GC A A C C

GG

GA

GGUUG

UGG

G AG

A

G

UA

GGU

C

G

C

C

G

C

C

G

G

A

C

A

20

40

60

80

100

120

Reference, Dynalign and X-Dynalign structures for the 5S rRNAK02682

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 122: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Pros: eXtended Dynalign

I The mean PPV is higher

I Better worse case scenario

I The average sensitivity is slightly degraded. However, for themajority of the sequences the minimum sensibility is higher foreXtended Dynalign

I Some subtle details, such as the variable loop of some tRNAs,are well reproduced

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 123: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 124: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Cons: eXtended Dynalign

I O(|S1|2M4) space, O(|S1|3M6) time

I Severe constraint M, M ≤ 6

I Up to two weeks of CPU time for some sequences1

I Length limited to some 150 nucleotides

1Sun Fire V20z, AMD Opteron 2.2 GHzMarcel Turcotte RNA Structure Prediction & Dabase Search

Page 125: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Summary

I Comparative sequence analysis, gold standard, tedious,partially automated

I Single sequence methods work best for short sequences, lessthan 100 nt, but still the quality of the results vary greatly

I Consensus approaches produce good results, but are CPUintensive

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 126: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Summary

I Comparative sequence analysis, gold standard, tedious,partially automated

I Single sequence methods work best for short sequences, lessthan 100 nt, but still the quality of the results vary greatly

I Consensus approaches produce good results, but are CPUintensive

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 127: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

DefinitionsComparative Sequence AnalysisThermodynamicsConsensus

Summary

I Comparative sequence analysis, gold standard, tedious,partially automated

I Single sequence methods work best for short sequences, lessthan 100 nt, but still the quality of the results vary greatly

I Consensus approaches produce good results, but are CPUintensive

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 128: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Database Search Problem

Find all sequences containing a user specified motif or all thesequences that can be folded into a user specified structure

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 129: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

H1 s1 H2 s2 H2 s3 H3 s4 H3 s5 H1

H1 3:5 0

H2 4:5 1 AGC:GCU

H3 4:5 1

S1 3:6 UCC

S2 5:7

S3 0:3

S4 5:8 GAGA

S5 3:5

R H2 H3 H1

M 1

H1 H1s1AGC

H2 H2 s5s2 s3 H3 H3s4

GAGAUCC

GUC

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 130: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

> RNAMOT -s -s mydb.fa -d mystery.mot

--- HUM7SLR1 Human 7SL RNA pseudogene, clone p7L30.1. --- (110 bases)

|SCO: 201.40|POS:6-56|MIS: 0|WOB: 0|

|CAGCU|GAUGCU|AGCU|GAUGCU|AGCU|-|GAUCG|UAGCUAGU|CGAUC|CGU|AGCUG|

...

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 131: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 132: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 133: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 134: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

I RSearch and INFERNAL are principled approaches

I Based on CYK (Cocke-Younger-Kasami) algorithm for parsingcontext-free grammars

I Solid statistical foundation

I RSearch takes as input a secondary and a secondary structure

I INFERNAL takes as input an MSA and a consensus structure

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 135: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

I RSearch and INFERNAL are principled approaches

I Based on CYK (Cocke-Younger-Kasami) algorithm for parsingcontext-free grammars

I Solid statistical foundation

I RSearch takes as input a secondary and a secondary structure

I INFERNAL takes as input an MSA and a consensus structure

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 136: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

I RSearch and INFERNAL are principled approaches

I Based on CYK (Cocke-Younger-Kasami) algorithm for parsingcontext-free grammars

I Solid statistical foundation

I RSearch takes as input a secondary and a secondary structure

I INFERNAL takes as input an MSA and a consensus structure

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 137: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

I RSearch and INFERNAL are principled approaches

I Based on CYK (Cocke-Younger-Kasami) algorithm for parsingcontext-free grammars

I Solid statistical foundation

I RSearch takes as input a secondary and a secondary structure

I INFERNAL takes as input an MSA and a consensus structure

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 138: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

I RSearch and INFERNAL are principled approaches

I Based on CYK (Cocke-Younger-Kasami) algorithm for parsingcontext-free grammars

I Solid statistical foundation

I RSearch takes as input a secondary and a secondary structure

I INFERNAL takes as input an MSA and a consensus structure

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 139: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

RSearch

# STOCKHOLM 1.0

#=GS Holley DE tRNA-Ala that Holley sequenced from Yeast genome

Holley GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATTCCGGACTCGTCCA

#=GR Holley SS

(((((.(..((((........)))).(((((.......))))).....(((((.......)))))).))))).

//

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 140: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

RSearch

I RIBOSUM substitution matrices(analogous to residue substitution scores such as PAM andBLOSUM, but for base pairs)

I Reports the statistical significance of all the matches

I Execution time is O(NM3) where N is the size of thedatabase and M is the length of the input sequence

I “(. . . ) a typical single search of a metazoan genomemay take a few thousand CPU hours”

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 141: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

RSearch

I RIBOSUM substitution matrices(analogous to residue substitution scores such as PAM andBLOSUM, but for base pairs)

I Reports the statistical significance of all the matches

I Execution time is O(NM3) where N is the size of thedatabase and M is the length of the input sequence

I “(. . . ) a typical single search of a metazoan genomemay take a few thousand CPU hours”

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 142: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

RSearch

I RIBOSUM substitution matrices(analogous to residue substitution scores such as PAM andBLOSUM, but for base pairs)

I Reports the statistical significance of all the matches

I Execution time is O(NM3) where N is the size of thedatabase and M is the length of the input sequence

I “(. . . ) a typical single search of a metazoan genomemay take a few thousand CPU hours”

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 143: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

RSearch

I RIBOSUM substitution matrices(analogous to residue substitution scores such as PAM andBLOSUM, but for base pairs)

I Reports the statistical significance of all the matches

I Execution time is O(NM3) where N is the size of thedatabase and M is the length of the input sequence

I “(. . . ) a typical single search of a metazoan genomemay take a few thousand CPU hours”

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 144: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

INFERNAL

I Nawrocki et al. Infernal 1.0: inference of RNA alignments.Bioinformatics (2009) pp.

I Also based on CYK, but uses an MSA as input

I MSA + Structure ⇒ Covariance Model

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 145: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

INFERNAL/Rfam Covariance Models

S0

S1 S2

u S3 S4

aS5 S6

a S8

u

...

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 146: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

INFERNAL/Rfam Covariance Models

# STOCKHOLM 1.0

#=GC SS_cons <<<<..>>>>

seq1 GGAGAUCUCC

seq2 GGGGAUCCCC

seq3 UGGGAACCCA

seq4 GGGGAUCCCU

seq5 GGGGAACCCC

//

S0

S1 S2

u S3 S4

aS5 S6

a S8

u

...

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 147: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Specialized Programs

I tRNAscan-SEI tRNAscan and EufindtRNA identify candidates that are

subsequently analyse by Cove (INFERNAL)I 1 false positive per 15 billion ntI Detect 99% of true tRNAI http://lowelab.ucsc.edu/tRNAscan-SE/

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 148: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Summary

I Sequence alignment methods are not appropriate for RNA

I Tools such as RNAMOT, RNABOB and RNAMOTIF allows todescribe and find RNA structure motifs in sequence databases

I RSEARCH finds all the sequences having a similar sequenceand secondary structure to that of an input sequence andstructure;

I Homologous sequences and structures can be represented as acovariance model. The software program INFERNAL allows tofind all the sequences that are likely to share the same overallfold (secondary structure)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 149: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Summary

I Sequence alignment methods are not appropriate for RNA

I Tools such as RNAMOT, RNABOB and RNAMOTIF allows todescribe and find RNA structure motifs in sequence databases

I RSEARCH finds all the sequences having a similar sequenceand secondary structure to that of an input sequence andstructure;

I Homologous sequences and structures can be represented as acovariance model. The software program INFERNAL allows tofind all the sequences that are likely to share the same overallfold (secondary structure)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 150: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Summary

I Sequence alignment methods are not appropriate for RNA

I Tools such as RNAMOT, RNABOB and RNAMOTIF allows todescribe and find RNA structure motifs in sequence databases

I RSEARCH finds all the sequences having a similar sequenceand secondary structure to that of an input sequence andstructure;

I Homologous sequences and structures can be represented as acovariance model. The software program INFERNAL allows tofind all the sequences that are likely to share the same overallfold (secondary structure)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 151: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Summary

I Sequence alignment methods are not appropriate for RNA

I Tools such as RNAMOT, RNABOB and RNAMOTIF allows todescribe and find RNA structure motifs in sequence databases

I RSEARCH finds all the sequences having a similar sequenceand secondary structure to that of an input sequence andstructure;

I Homologous sequences and structures can be represented as acovariance model. The software program INFERNAL allows tofind all the sequences that are likely to share the same overallfold (secondary structure)

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 152: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

References I

Can Alkan, Emre Karakoc, Joseph H Nadeau, S CenkSahinalp, and Kaizhong Zhang.RNA-RNA interaction prediction and antisense RNA targetsearch.J Comput Biol, 13(2):267–82, Mar 2006.

Mirela Andronescu, Zhi Chuan Zhang, and Anne Condon.Secondary structure prediction of interacting RNA molecules.J Mol Biol, 345(5):987–1001, Feb 2005.

Ho-Lin Chen, Anne Condon, and Hosna Jabbari.An O(n(5)) algorithm for MFE prediction of kissing hairpinsand 4-chains in nucleic acids.J Comput Biol, 16(6):803–15, Jun 2009.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 153: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

References II

Hamidreza Chitsaz, Raheleh Salari, S Cenk Sahinalp, and RolfBackofen.A partition function algorithm for interacting nucleic acidstrands.Bioinformatics, 25(12):i365–73, Jun 2009.

Y. Ding and C. E. Lawrence.A bayesian statistical algorithm for rna secondary structureprediction.Computers & Chemistry, pages 387–400, 1999.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 154: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

References III

Paul P Gardner, Jennifer Daub, John G Tate, Eric P Nawrocki,Diana L Kolbe, Stinus Lindgreen, Adam C Wilkinson, Robert DFinn, Sam Griffiths-Jones, Sean R Eddy, and Alex Bateman.Rfam: updates to the RNA families database.Nucleic Acids Research, 37(Database issue):D136–40, Jan2009.

Robert J Klein and Sean R Eddy.RSEARCH: finding homologs of single structured RNAsequences.BMC Bioinformatics, 4:44, Sep 2003.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 155: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

References IV

J S McCaskill.The equilibrium partition function and base pair bindingprobabilities for RNA secondary structure.Biopolymers, 29(6-7):1105–19, Jan 1990.

E Nawrocki, D Kolbe, and S Eddy.Infernal 1.0: inference of rna alignments.Bioinformatics, Mar 2009.

Jakob Skou Pedersen, Gill Bejerano, Adam Siepel, KateRosenbloom, Kerstin Lindblad-Toh, Eric S Lander, W JamesKent, Webb Miller, and David Haussler.Identification and classification of conserved rna secondarystructures in the human genome.PLoS Comput Biol, 2(4):e33, Apr 2006.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 156: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

References V

David Sankoff.Simultaneous solution of RNA folding, alignment andprotosequence problems.SIAM J. Appl. Math., 45(5):810–825, 1985.

M. Zuker.On finding all suboptimal foldings of an RNA molecule.Science, 244:48–52, 1989.

M. Zuker and P. Stiegler.Optimal computer folding of large RNA sequences usingthermodynamics and auxiliary information.Nucl. Acids Res., 9:133–148, 1981.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 157: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

References VI

Michael Zuker and David Sankoff.RNA secondary structure and their prediction.Bulletin of Mathematical Biology, 46(4):591–621, 1984.

Marcel Turcotte RNA Structure Prediction & Dabase Search

Page 158: RNA Structure Prediction & Dabase Searchturcotte/teaching/bnf-5106/2013-10-21_MTurcotte.pdfMarcel Turcotte RNA Structure Prediction & Dabase Search. Preamble Introduction Inference

PreambleIntroduction

InferenceSearch

First Generation: RNAMOT, RNABOBSecond Generation: RNAMOTIFThird Generation: RSearch, INFERNAL

Please don’t print these lecture notes unless you really need to!

Marcel Turcotte RNA Structure Prediction & Dabase Search