Post on 24-Dec-2015
RNA multiple sequence alignment
Craig L. Zirbelzirbel@bgsu.eduOctober 14, 2010
RNA primary sequences Laboratory techniques make it possible to
extract specific RNA molecules and determine the sequence of nucleotides. Here are the (unaligned) sequences of the 5S ribosomal RNA molecule from different organisms:
UUAGGCGGCCACAGCGGUGGGGUUGCCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCACCAGCGUUCCGGGGAGUACUGGAGUGCGCGAGCCUCUGGGAAACCCGGUUCGCCGCCACC A H.m. (structure)GCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGC B E.coli (structure)UCCCCCGUGCCCAUAGCGGCGUGGAACCACCCGUUCCCAUUCCGAACACGGAAGUGAAACGCGCCAGCGCCGAUGGUACUGGGCGGGCGACCGCCUGGGAGAGUAGGUCGGUGCGGGG B T.th. (structure)AGUGGUGGCCAUAUCGGCGGGGUUCCUCCCCGUACCCAUCCUGAACACGGAAGAUAAGCCCGCCAGCGUCCGGCAAGUACUGGAGUGCGCGAGCCUCUGGGAAAUCCGGUUCGCCGCCAC A L27170.1/1-120GUAGCGGCCACAGCGGUGGGGUUCCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCACCAGCGUUCCGGGGAGUACUGGAGUGCGCGACCCUCUGGGAAACCGGGUUCGCCGCUAC A L27163.1/1-119GCGGCCAGGGCGGAGGGGAAACACCCGUACCCAUUCCGAACACGGAAGUGAAGCCCUCCAGCGAACCAGCUAGUACUAGAGUGGGAGACCCUCUGGGAGCGCUGGUUCGCCGCC A L27343.1/3-116UUUGGCGGUCAUGGCGUGGGGGUUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUUUUUUGCUGUGGGAAGCCCACUUCACUGCCAGAC A M36187.1/5-126GUUGGCGGUCAUGGCGUGGGGUUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUUUUUUGCUGUGGGAAGCCCACUUCACUGCCAGAC A X62857.1/1-121UUUGGCGGUCAUGGCGUGGGGGUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUGUUUUGCUGUGGGAAGCCCAUUUCACUGCCAGCC A X15364.1/6601-6721GUCGGUGGUGUUAGCGGUGGGGUCACGCCCGGUCCCUUUCCGAACCCGGAAGCUAAGCCUGCCUGCGCCGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGACCCCGCCGGCA B M16176.1/4-120GUCGGUGGUUAUAGCGGUGGGGUCACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCCACCUGCGCCGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGUCACCGCCGGCC B M16177.1/4-120GUUGGUGGUUAUUGUGUCGGGGGUACGCCCGGUCCCUUUCCGAACCCGGAAGCUAAGCCCGAUUGCGCUGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGUCGCUGCCAACC B X55255.1/4-120UACGGCGGUCAAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCAAUGAUACUGCCCUCACCGGGUGGAAAAGUAGGACACCGCCGAAC B X55259.1/3-117UACGGCGGUCCAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCGAUGAUACUACCCAUCCGGGUGGAAAAGUAGGACACCGCCGAAC B X55251.1/3-116UACGGCGGCCACAGCGGCAGGGAAACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCUGCCAGCGCCGAUGAUACUGCCCCUCCGGGUGGAAAAGUAGGACACCGCCGAAC B X75601.1/91-203UAAGGCGGCCAUAGCGGUGGGGUUACUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCGCCUGCGUUCCGGUCAGUACUGGAGUGCGCGAGCCUCUGGGAAAUCCGGUUCGCCGCCUACU A X03407.1/5927-6048UUGGCGACCAUAGCGGCGAGUGACCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCUCGCCUGCGUUUCGGUCAGUACUGGAUUGGGCGACCCUCUGGGAAAUCUGAUUCGCCGCCACC A L27168.1/1-120GGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCACCUGCGUUCUGGUCAGUACUGGAGUGAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCC A X02128.1/24-139GGGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCGCCUGCGUUCUGGUCAGUACUGGAGUGAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCCCU A X14441.1/5-123
Watson-Crick basepairs Watson-Crick basepairs can substitute for one another
freely without changing the structure of the RNA molecule. They are said to be isosteric, and changes between these basepairs is an example of neutral variability. They are held together by hydrogen bonds (dotted lines).
Superposition
RNA sequence variability
To preserve RNA helices, compensating mutations must be made; to replace a GC basepair with an AU basepair, two letters must change in distant regions of the sequence; see below. Statistically, this is called “long-range dependence.”
Compensating mutations such as this do not change the secondary or tertiary structure of the molecule.
UGCCUGGCGACCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAAUUGCCAGGCAU
UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU
Comparative sequence analysis By manually aligning similar RNA sequences and noting
the pairs of columns where mainly AU, CG, GC, and UA pairs occur, one can infer the secondary structure of an RNA molecule.
• This is the inferred secondary structure of the 5S RNA, with bases labeled as found in E. coli. There are five helical regions, with three “internal loops” and two “hairpin loops” separating them. Note the colors!
Fox & Woese 1975; Peattie et al. 1981; Noller 1984; Cannone et al. 2002; http://www.rna.ccbb.utexas.edu
UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU
RNA 3D structure• Starting late in the year
2000, high-resolution atomic structures of entire ribosomes have been published. These show the bases, the backbone, the Watson-Crick basepairs, and several new types of basepairs.
E. coli 5S
The 2009 Nobel Prize in Chemistry went to Yonath, Ramakrishnan, and Steitz for their work on x-ray crystal structures of ribosomes.
Three 5S rRNA 3D structures
Haloarcula marismortui E. coli Thermus thermophilus
RNA multiple sequence alignment The same RNA in different organism can
be presumed to have the same, or roughly the same, secondary and 3D structure.
Compensating changes far apart in the sequence make it hard to use multiple sequence alignment tools that were developed for proteins.
Two situations for RNA multiple sequence alignment1. We have two or more sequences from the
same RNA, but don’t know their common secondary structure or 3D structure
2. We have RNA sequences and a common secondary structure or even a single 3D structure which we can assume they all share to some degree
10-14-2010
RNA MSARNA Multiple Sequence AlignmentSlides by Anton Petrov, Ph.D. student, BGSU
Why DNA and protein alignment methods don’t work for RNA
RNA sequences may look dissimilar but still fold into the same structure.
Gorodkin et al., 2010. Trends in biotechnology
Gorodkin et al., 2010. Trends in biotechnology
Example
RNA-specific alignment methods
FOLDALIGN http://foldalign.ku.dk/index.html
MAFFT http://mafft.cbrc.jp/alignment/server/
LocARNA http://rna.informatik.uni-freiburg.de:8080/LocARNA.jsp
R-Coffee http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi?stage1=1&daction=RCOFFEE::Regular
and many others...
RNA MSA and ncRNA discovery Conservation is a reliable indicator of biological
importance. If an RNA fragment is conserved across multiple
species, it may function as ncRNA. ncRNA discovery programs scan multiple genomic
sequences in order to detect putative ncRNA candidates.
MSA is an essential part of the ncRNA discovery pipeline.
RNA MSA and ncRNA discovery
Multiple sequence alignment
ncRNA discovery
Secondary structure prediction
Align first
Fold first
Align and fold simultaneously
RNAz Once you have a good MSA, you can use
tools like RNAz to scan your alignment for conserved stable secondary structures, which may function as ncRNAs.
http://rna.tbi.univie.ac.at/cgi-bin/RNAz.cgi
Suggested reading
Alignment to a common secondary structure One standard starting point is a “seed” alignment
of 20-100 RNA sequences together with a “dot-bracket” secondary structure diagram.
Infernal is a program that makes a “covariance model” based on the seed alignment and allows one to align new sequences to this model, thus aligning new sequences to an existing alignment.
Alignment to a model based on a 3D structure One focus of the BGSU RNA group Take an RNA 3D structure, with all of the
detail it gives about Watson-Crick basepairs and other RNA basepairs
Make a model for sequence variability Align RNA sequences to the model, and
thus to one another.
H.m. 5S rRNA basepair diagram
Standard AU and GC Watson-Crick basepairs are denoted by = or –
In other pairs, a circle stands for the Watson-Crick edge, a square for the Hoogsteen edge, and a triangle for the Sugar edge.
The basepair diagrams for E.coli and T.Th. are similar
Working hypothesis: other organisms have largely the same basepair diagram, with neutral basepair substitutions that do not alter the 3D structure
Non-Watson-Crick basepairs The 3D structures show a variety of planar basepair
interactions other than Watson-Crick basepairs. These occur between helices and allow the RNA molecule to achieve tighter turns or other important 3D structural features.
trans Hoogsteen / Sugar Edge
A78-G98 in E.coli 5S
A45-U40 in E.coli 5S
cis Watson-Crick / Sugar Edge
A57 – C30 in E.coli 5S
A46 – A39 in E.coli 5S
trans Sugar Edge / Sugar Edge
G13 – G69 in E.coli 5S
Isostericity for non-Watson-Crick basepairs
Non-Watson-Crick basepairs have different basepair substitution (isostericity) rules than Watson-Crick pairs. Below are some examples of geometrically similar basepairs.
trans Hoogsteen / Sugar Edge
A78-G98 in E.coli 5S
A45-U40 in E.coli 5S
cis Watson-Crick / Sugar Edge
A57 – C30 in E.coli 5S
A46 – A39 in E.coli 5S
trans Sugar Edge / Sugar Edge
G13 – G69 in E.coli 5S
Stochastic grammars Stochastic grammars are probabilistic models for
sequences of characters or words. They are capable of enforcing specified
grammatical rules but allowing for variability in the specific sequence.
The classic example: Colorless green ideas slept furiously
obeys English grammatical rules, but is a very unlikely sentence to occur in normal English.
Context free grammars have certain limitations on the grammatical rules that can be enforced.
Chomsky 1956, 1959; Durbin and Eddy 1994.
Simple SCFG model for RNA From the basepair diagram, we construct a model which
mimics the structure of the molecule but which allows for neutral basepair variability and other minor variations.
The 5S itself is too large, so we display a very small cartoon of the 5S molecule.
5’
3’
Using the SCFG model to generate sequence variants
The Initial node generates letters independently with a given length and letter distribution. This time we get an A on the left and CA on the right.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
A Basepair node generates a (dependent) pair of letters and independent insertions. The first Basepair node generates a CG pair and inserts an A on the right (before the G).
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
The Basepair node generates CG with no insertions.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
The Junction node generates nothing, but passes control to its two child nodes.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
The Initial node on the left branch generates U on the left and AC on the right; the Initial node on the right branch generates AU.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
The Basepair node on the left branch generates GC; the Basepair node on the right branch generates AG.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
The Basepair node on the left branch generates UA; the Basepair node on the right branch generates GA.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
The Hairpin node on the left branch generates UUCG (a variant of the UNCG hairpin) and the last Basepair node on the right branch generates GC.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
Finally, the last Hairpin generates GAAGA (a variant of the GNRA loop with one insertion) and generation stops.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Using the SCFG model to generate sequence variants
Parsing sequences according to a model
Typically, a model can generate the same sequence in several different ways.
Given a model and a sequence that was generated by the model, we want to determine the single way of generating the sequence that is most likely.
The most likely generation history tells which node generated which part of the sequence, and so aligns the sequence to the model.
Multiple ways to generate a sequence Here is another way that the
same simple model could have generated the sequence. This generation history would have very low probability, since the letters indicated to make certain basepairs are not isosteric with the originally observed basepair.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Determining the maximum probability generation history for a sequence
The CYK (Cocke, Younger, Kasami) dynamic programming algorithm has each node (from leaves to root) consider each subsequence (from shorter to longer) to consider the maximum probability way that it and its children would generate the subsequence.
Here, the blue Basepair node considers how it and its children can generate the colored subsequence.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Determining the maximum probability generation history for a sequence The blue Basepair node considers
another way for it and its children to generate the colored subsequence.
The red nodes have already considered every subsequence of this length and shorter.
The algorithm runs in O(L2M) time, where L is the length of the input sequence and M is the number of nodes.
ACCUGUUUCGACACAGGGAAGACAGAUGAGCA
Uses for SCFG models for RNA
Multiple sequence alignment of RNA Searching genomes for RNAs homologous to a given
RNA Infernal is a commonly used SCFG program
Note: Some RNAs are absolutely ancient: messenger RNA, ribosomal RNA, transfer RNA, but there are many RNAs that people are just learning about now, like microRNAs and other regulatory RNAs. They occur in UTRs, introns, and intergenic regions, and we need to be able to recognize them!