Research Article Characteristics and Prediction of RNA …Research Article Characteristics and...

Research ArticleCharacteristics and Prediction of RNA Structure

Hengwu Li12 Daming Zhu3 Caiming Zhang3 Huijian Han1 and Keith A Crandall2

1 School of Computer Science and Technology Shandong Provincial Key Laboratory of Digital Media TechnologyShandong University of Finance and Economics Jinan 250014 China

2 Computational Biology Institute George Washington University Ashburn VA 20147 USA3 School of Computer Science and Technology Shandong Provincial Key Laboratory of Software EngineeringShandong University Jinan 250101 China

Correspondence should be addressed to Hengwu Li hengwusdufeeducn

Received 28 February 2014 Accepted 11 June 2014 Published 6 July 2014

Academic Editor Maria Cristina De Rosa

Copyright copy 2014 Hengwu Li et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

RNA secondary structures with pseudoknots are often predicted by minimizing free energy which is NP-hard Most RNAsfold during transcription from DNA into RNA through a hierarchical pathway wherein secondary structures form prior totertiary structures Real RNA secondary structures often have local instead of global optimization because of kinetic reasons Theperformance of RNA structure prediction may be improved by considering dynamic and hierarchical folding mechanisms Thisstudy is a novel report on RNA folding that accords with the golden mean characteristic based on the statistical analysis of the realRNA secondary structures of all 480 sequences from RNA STRAND which are validated by NMR or X-ray The length ratios ofdomains in these sequences are approximately 0382L 05L 0618L and L where L is the sequence length These points are just theimportant golden sections of sequence With this characteristic an algorithm is designed to predict RNA hierarchical structuresand simulate RNA folding by dynamically folding RNA structures according to the above golden section points The sensitivityand number of predicted pseudoknots of our algorithm are better than those of the Mfold HotKnots McQfold ProbKnot andLhw-Zhu algorithms Experimental results reflect the folding rules of RNA from a new angle that is close to natural folding

1 Introduction

RNAs are versatile molecules Messenger RNAs carry geneticinformation and act as the intermediary agent betweenDNAsand proteins ribosomal RNAs transfer RNAs and othernoncoding RNAs also have important structural regulatoryand catalytic functions in cells To completely understand thevarious functions of RNAs we need to first understand theirstructures The primary structure of RNA is the sequenceof nucleotides (ie four bases ACG and U) in the single-stranded polymer of RNA However these sequences are notsimply long strands of nucleotides In RNA complementarybases of guanine and cytosine pair can form three hydrogenbonds those of adenine and uracil pair can form twohydrogen bonds and those of guanine and uracil pair canform two hydrogen bonds RNA folds into a 3D structurethrough hydrogen bonding and base stacking which arenonconsecutive in the sequence Noncanonical pairing and

base-to-backbone hydrogen bonding also stabilize foldingThe 3D arrangement of atoms in a folded RNA moleculeis the tertiary structure the collection of base pairs in thetertiary structure is the secondary structure Experimentaldetermination of RNA tertiary structures is too expensiveand time consuming tomeet practical needs thus predictingRNA structure by computer becomes a basic method andissue in computational biology [1]

The secondary structures of RNA include the scaffoldof the tertiary structures Predicting RNA secondary struc-tures is the first step in predicting RNA tertiary structuresfrom RNA sequences Computational approaches for pre-dicting RNA secondary structures can be classified intothree families thermodynamic comparative and hybridThermodynamic approaches use dynamic programming tocompute the optimal secondary structure for a single RNAsequence with globally minimal free energy [2] based on aset of experimentally determined energy parameters [3] Such

Hindawi Publishing CorporationBioMed Research InternationalVolume 2014 Article ID 690340 10 pageshttpdxdoiorg1011552014690340

2 BioMed Research International

methods have been successful for relatively short RNAsMan-ually comparative approaches are more reliable than ther-modynamic approaches when many homologous sequencesare available Manually comparative approaches have beenused to establish the structures of known RNA familiesThese approaches compute a consensus structure on a set ofaligned RNA sequences by searching for covariance evidencebetween each of the base pairs Quantitative measures ofcovariance have been implemented in 1205942 statistics andmutual information Akmaev et al [4] also extended theseapproaches to explicitly consider sequence phylogeny andshowed positive results Hybrid approaches which haverecently emerged combine the advantages of thermodynamicand comparative approaches [5] Hybrid approaches considerboth thermodynamic stability and sequence covariance andproduce positive results on as few as three homologoussequences Other methods cannot be classified into any ofthese three families A few of these methods attempt tosimultaneously align and fold homologous sequences [6]Eddy and Durbin [7] introduced stochastic context-freegrammars to align homologous sequences iteratively andfound a consensus structure for them

A pseudoknot motif is a prevalent RNA structure Pseu-doknots serve various functions in biology [1] Plausible pseu-doknotted structures have been proposed and confirmed forthe 31015840-end of several plant viral RNAs where pseudoknots areapparently used tomimic tRNA structures Pseudoknots havebeen recently confirmed in some RNAs of humans and otherspecies [6]

Current studies on RNA secondary structure predictionhave not considered pseudoknots Optimizing secondarystructures including arbitrary pseudoknots is NP-hard [8]

Most RNA folding methods that can fold pseudoknotsadopt heuristic search procedures and sacrifice optimal-ity Examples of these methods include quasi-Monte Carlosearches and genetic algorithms These methods cannotguarantee the most optimal structure and cannot determinethe accuracy of a given prediction toward optimality [9ndash13]

A different approach to pseudoknot prediction adoptsdynamic programming to predict the tractable subclass ofpseudoknots based on complex thermodynamic models inO(1198994)ndashO(1198996) time [14ndash16] making them impractical even forsequences of a few hundred bases long

Comparative approaches can also be applied to predictpseudoknots and are more reliable than thermodynamicapproaches For example comparative analysis has revealedthe existence of pseudoknots in several RNAs [17] Howevercomparative analysis has typically been conducted in an adhoc manner from an algorithmic point of view

RNAs fold during transcription from DNA into RNACurrent RNA structure prediction by calculating the globaloptimal structure does not reflect the dynamic folding mech-anism of RNA [18]

Although DP can accurately predict a minimum energystructure within a given thermodynamic model the nativefold is often in a suboptimal energy state that significantlyvaries from the predicted one [19] A case may be made thatthe natural folding process of RNA and the simulated folding

of RNA using an evolutionary algorithm which includesintermediate folds have much in common [20 21]

The current study provides a novel report that RNAfolding accords with the golden mean characteristic basedon the statistical analysis of real RNA secondary structuresThe golden mean is also called the golden section or goldenratio Adolf Zeising found the golden ratio expressed in thearrangement of branches along the stems of plants and ofveins in leaves [22] He extended his research to the skeletonsof animals and the branches of their veins and nervesto the proportions of chemical compounds and geometryof crystals and even to the use of proportion in artisticendeavors In these phenomena he found the golden ratiooperating as a universal law In 2010 the journal Sciencereported that the golden ratio is present at the atomic scalein the magnetic resonance of spins in cobalt niobate crystals[23] Several researchers have proposed connections betweenthe golden ratio and human genome DNA [24 25]

Applying this characteristic we design a golden mean(GM) algorithm by dynamically folding RNA secondarystructures according to the golden section points andby forming pseudoknots subsequently folded betweennucleotides that did not pair in previous steps

We implement the method using thermodynamic dataand test the performance on PKNOTS and TT2NE data setFor PKNOTS data set the sensitivity and PPV of the Lhw-Zhu (LZ) and PKNOTS algorithms are increased by 2 to 3via the preprocessing of theGMmethod For TT2NEdata setthe GM method indicates good performance in predictingsecondary and pseudoknotted structures The experimentalresults reflect the folding rules of RNA from a new angle thatis close to natural folding

2 Materials and Methods

21 Structure Prediction Let sequence 119904 = 11990411199042sdot sdot sdot 119904119899be

a single-stranded RNA molecule where each base is 119904119894isin

AUCG 1 le 119894 le 119899 The subsequence 119904119894119895= 119904119894119904119894+ 1 sdot sdot sdot 119904

119895is

a segment of 119904 1 le 119894 le 119895 le 119899For first approximation the secondary structure is mod-

eled as follows If 119904119894and 119904

119895are complementary bases

(AampUCampGUampG) then 119904119894and 119904119895may constitute a base pair

(119894 119895) Each base can occur in one base pair or the set of basepairs can form a matching The secondary structures are alsononcrossing

Concretely a secondary structure 119878 on 119904 is a set of basepairs 119878 = (119894 119895) where 119894 119895 isin 1 2 119899 that satisfies thefollowing conditions

No Sharp Turns The ends of each pair in 119878 are separated by atleast four intervening bases that is if (119894 119895) isin 119878 then 119894 lt 119895 minus 3

For any pair (119894 119895) in 119878 (119894 119895) isin (AU) (CG) (UG)(UA) (GC) (GU)119878 is a matching no base appears in more than one pair

Noncrossing Condition If (119894 119895) and (119896 119897) are two pairs in 119878then they are compatible that is they are juxtaposed (eg119894 lt 119895 lt 119896 lt 119897) or nested (eg 119894 lt 119896 lt 119897 lt 119895)

BioMed Research International 3

Base pair and internal unpaired bases construct loops If(119894 119895) and (119894 + 1 119895 minus 1) isin 119878 base pairs (119894 119895) and (119894 + 1 119895 minus 1)constitute stack (119894 119894 + 1 119895 minus 1 119895) and 119898 (ge 1) consecutivestacks form the helix (119894 119894 + 119898 119895 minus 119898 119895) with the length of119898 + 1

If base pairs (119894 119895) and (119896 119897) are parallel (119894 lt 119895 lt 119896 lt119897 or 119896 lt 119897 lt 119894 lt 119895) or nested (119894 lt 119896 lt 119897 lt 119895 or 119896 lt 119894 lt 119895 lt 119897)then base pairs (119894 119895) and (119896 119897) are compatible otherwise theyare incompatible Such an incompatible structure is knownasa pseudoknot (eg 119894 lt 119896 lt 119895 lt 119897) More complexpseudoknots may occur if three or more base pairs cross eachother

The concept of the domain was first proposed in 1973by Wetlaufer after X-ray crystallographic studies on henlysozyme [26] and papain [27] and after limited proteolysisstudies on immunoglobulins [28 29] Wetlaufer defineddomains as stable units of protein structure that can foldautonomously Domains have been previously described asunits of compact structure [30] function and evolution [31]and folding [32]

Each domain forms a compact 3D structure and is oftenindependently stable and foldedMost domains have less than200 residues with an average of approximately 100 residues[33 34]

A domain 119863(1198941015840 1198951015840) consists of all (1198941015840 1198951015840) that satisfy(119894

1015840 119895

1015840) isin 119863(119894 119895) then 119894 lt 1198941015840 lt 1198951015840 lt 119895 A base pair can only

occur in one domainA domain is closed by a helix or pseudoknot (Figure 3)

A subdomain is an independently stable part of a domainIf the closed helix or pseudoknot of a domain is deleted itssubdomain will become a domain

By convention single strands of DNA and RNAsequences are written in 51015840-to-31015840 direction RNAs foldduring transcription from DNA into RNA The subsequence119904119894119895

begins to transcribe from the 51015840-end that is 119904119894 and

terminates transcription at the 31015840-end that is 119904119895(Figure 3)

The helix (119894 119894 + 119898 119895 minus 119898 119895) is completely folded after thetranscription of 119904

119895 We can determine that the 31015840-end of the

helix (119894 119894 + 119898 119895 minus 119898 119895) and the domain 119863(119894 119895) is the 31015840-endof subsequence 119904

119894119895 Let 119871 be the sequence length The length

ratio of the 31015840-end of the helix (119894 119894 + 119898 119895 minus 119898 119895) and thedomain 119863(119894 119895) to the sequence 119904

1119871is the ratio of 119895 to 119871 This

study determines the characteristic of the 31015840-end and thelength of domain in the sequences

22 Characteristic of Golden Section We compare the struc-tures of the test set of all 480 sequences (nonfragment andnonredundant) from RNA STRAND with secondary andpseudoknotted structures which are validated by NMR or X-ray

The results of statistical analysis on these real secondarystructures are shown in Figures 1 and 2 and Tables 1 2 3 and4 In Tables 1 and 2 Num represents the number of domainsgroup is the 31015840-end of group Ratio 1 is the ratio of Group 1to Group 2 and Ratio 2 is the ratio of Group 2 to Group 3 InFigures 1 and 2 the119909-axis represents the length ratio of the 31015840-end of domain to the sequence and the 119910-axis represents thenumber of sequences The number of complementary bases

50

40

30

20

10

0

01 02 03 038 045 055 062 07 08 09 1

Statistics of sequences with length between 50 and 90

in RNA STRAND

Num

ber o

f seq

uenc

es

3998400-end of domainlength of sequence

Figure 1 Distribution of domains for sequences with lengthsbetween 50 and 90 (a) For all NMR- or X-ray-validated data on 55nonfragment and nonredundant sequences with lengths between 50and 90 from RNA STRAND except for synthetic RNA the lengthratio of the 31015840-end of the domains to the sequence is computed andsummarized If one sequence has only one domain subdomains areselected (b) The 119909-axis represents the length ratio of the 31015840-end ofthe domain to the sequence and the 119910-axis represents the numberof sequences

20

25

30

35

10

15

0

5Num

ber o

f seq

uenc

es

01 02 03 038 045 055 062 07 08 09 1


Statistics of tRNA in RNA STRAND

Figure 2 Distribution of domains for tRNA (a) For all NMR- or X-ray-validated data on 35 nonfragment and nonredundant sequenceswith lengths more than 50 from RNA STRAND the length ratioof the 31015840-end of the domains to the sequence is computed andsummarized If one sequence has only one domain subdomains areselected (b) The 119909-axis represents the length ratio of the 31015840-end ofthe domain to the sequence and the 119910-axis represents the numberof sequences

to form a helix at point 119909 in the final structure is insufficientThus we enlarge point 119909 to region 119909 and the correspondingpoint 119910 in the 119910-axis with 119909 in the 119909-axis represents thenumber of sequences in the region of [119909 minus 04 119909]

For 16S rRNA 26 sequences belong to RNA STRANDThe statistical result is shown in Table 1 The number ofdomains varies from 4 to 8 The length ratio of the firstdomain to the sequence is shown as Ratio 2 The length ratioof the first subdomain to the first domain is shown as Ratio 1The two ratios are both close to 0618

For example four domains belong to sequencePDB 00409 namely D (3 726) D (731 1063) D (10841129) and D (1150 1171) In addition the subdomains of D(23 437) are D (443 705) and D (1150 1171) We can dividethe sequence into three groups namely 1ndash442 1ndash730 and731ndash1174 which represent the length of the entire sequenceThe ratio of Group 1 to Group 2 is 061 and the ratio of Group2 to Group 3 is 062 to 038


1

11 25

52

ACGGA

CCGU

U

AUA

AU

AA

A U A

UU G U

G

A A UC

AC

ACGUG

GU

GC

GUACGAUAA

CG

C

5998400-AUAAUAAAUAACGGAUUGUGUCCGUAAUCACACGUGGUGCGUACGAUAACGCAUAGUGUUUUUCCCUCCACUUAAAUCGAAGGG-3998400

1

11

25

37

5484

ACGGAU C C G U

GU

GC

GU A

CGCAU

AUAAUAA

AU

AUUGUG

AAUCAC

ACGUG

ACG A U

A

A G U G U U UUUCCC

UCCACUUA

AA

UC

GAAGGG

1

11

25

37

54

64

84

ACGGAU C C G U

G UG C G UACGCAU

CC

CUA

GG

GAUAA

UA

AAUA

UUGU G

AA U C A C A C G UG

A C GAUA

AGUGU

UU

UU

C CACU

UAAAUCGA

1

11

32

54

84

A C G G A

UGUGUCCGU

C A C A

GU

GC

GU

AC

GC

AU

AG

UG

CC

CU

CA

CU

AG

GGAU

AA

UA

AAUA

UAA

U

C G UG

AC G

AU

A

UUU

UU

C

UA

AA

UCGA

(a) The first fold (b) The second fold (c) The third fold

1

11 25

37

59

84

AC

GG

A UCCGU

A C A CG U G C G U

ACGCAUGUGU

CC

CU

AG

GGA

UA

AUAA

AU

A

UUG U

G

A A UC

GU G

A C GA

UA

A

UU

UU

C C ACU

UAAAU

CGA

(d) Predicted structure by GM (f) Predicted structure by PKNOTS

1

1332

54

84

GG

A

UG

UG

UC

C

CA

CA

CG

U

GU

GC

GU

AC

GA

CG

CA

U

AG

UG

CC

CU

CC

AC

U

GA

GG

GAUAA

UA

AA

UAAC

UGUA A U G

A U A

UUU

UU

UAA

AUCA

(e) Native structure

Figure 3 Computation of sequence TMVup (a) Status after the first fold of the GM algorithm (b) Status after the second fold of the GMalgorithm (c) Intermediate result of the last step of the GM algorithm (d) Final fold and predicted structure by GM (e) Native structure ofTMVup (f) Final fold and predicted structure by PKNOTS (g) The top shows the sequence of TMVup

For long 5S rRNA 23S rRNAGI Intron rRNA and tRNAin RNA STRAND the statistical results are shown in Table 2The number of domains varies from 1 to 6 The length ratioof the first subdomain to the first domain for 5S rRNA is058 and that of GI Intron is close to 037 The length ratioof the second domain to the sequence for tRNA and otherrRNA is close to 038 The length ratio of the third domain tothe sequence for tRNA is close to 06 and that of the fourthdomain to the sequence for other rRNA is 059 For 23S rRNAthe subdomains areD (16 2625)D (2630 2788) andD (26472726)The subdomains ofD (16 2625) areD (29 476)D (5791261) D (1269 2011) and D (2023 2040) D (16 2625) is apseudoknot Ratio 1 is the ratio of 1261 to 2040 and Ratio 2 is2040 to the sequence length

For short RNAs sequences with lengths less than 50have generally only one domain and one or two simplehelices For sequences with lengths between 50 and 90except for synthetic RNA the folding 31015840-end of domainsand subdomains is centered on three regions (Figure 1)

The sequences have one to three domains and that with onlyone domain has two or three subdomainsThe sequences foldon three regionsThe first folds in the region of 035L to 038Lthe second in the region of 06L to 0618L and the third inthe region of 09L to 10L Among 33 sequences 25 28 and49 have helix 31015840-ends located in points 0382L 0618L and Lrespectively

For tRNAs with lengths more than 50 the folding 31015840-end of domains and subdomains is centered on four regions(Figure 2)The sequences have one to three domains and thatwith only one domain has two or three subdomains Among35 sequences 15 14 14 and 33 have domain or subdomain 31015840-ends located at points 0382L 05L 0618L and L respectively

The data corresponding to Figure 1 are shown in Table 3The data corresponding to Figure 2 are shown in Table 4The sequences tend to fold twice or thrice For the

sequences that fold twice the first folds in the region of 05Land the second in the region of 119871 For the sequences that foldthrice the first folds in the region of 035L to 038L the second


Table 1 Distribution of domains for 16S rRNA

Sequence ID Length Num Group 1 Group 2 Group 3 Ratio 1 Ratio 2PDB 00409 1174 4 442 730 1174 061 062PDB 00456 1514 4 548 896 1514 061 059PDB 00478 1513 4 545 893 1513 061 059PDB 00643 1688 9 559 913 1688 061 054PDB 00645 1533 4 559 915 1533 061 060PDB 00703 1466 4 520 868 1466 060 059PDB 00769 1526 4 559 915 1533 061 060PDB 00791 1530 4 562 916 1533 061 060PDB 00811 1521 5 545 893 1521 061 059PDB 00812 1522 5 545 893 1522 061 059PDB 00813 1522 5 545 893 1522 061 059PDB 00814 1522 5 545 893 1522 061 059PDB 00946 1673 6 545 895 1673 061 053PDB 00952 1673 8 545 895 1673 061 053PDB 01015 1527 5 546 894 1527 061 059PDB 01039 1511 4 545 893 1511 061 059PDB 01088 1716 7 532 880 1716 060 051PDB 01101 1667 5 545 893 1667 061 054PDB 01103 1793 6 545 893 1667 061 054PDB 01105 1694 5 545 893 1667 061 054PDB 01107 1505 4 545 893 1533 061 058PDB 01139 1687 6 545 893 1673 061 053PDB 01198 1660 5 545 893 1667 061 054PDB 01240 1685 5 545 893 1667 061 054PDB 01268 1528 5 545 893 1522 061 059PDB 01269 1529 5 545 893 1522 061 059

Table 2 Distribution of domains for other long RNAs

Sequence ID Type Length Num Group 1 Group 2 Group 3 Ratio 1 Ratio 2PDB 00030 5S rRNA 120 1 69 120 058PDB 00082 GI Intron 315 1 116 315 037PDB 00140 GI Intron 314 1 116 316 037PDB 00398 tRNA 380 5 148 228 408 038 06PDB 01144 Other rRNA 408 6 151 240 408 037 059PDB 00029 23S RRNA 2904 1 1261 2040 2904 062 070

in the region of 06L to 0618L and the third in the region ofL

In mathematics and the arts two quantities belong to thegolden ratio if their ratio is the same as the ratio of their sumto their maximum Expressed algebraically for quantities 119886and 119887 with 119886 gt 119887 119886119887 = (119886 + 119887)119886 = phi = 1618 and thequantities are 0382 (119886 + 119887) and 0618 (119886 + 119887) These quantitiesincrease several unique ratios including 0618 0382 and1618 that is the golden ratio These ratios exist throughoutnature from population growth to the physical structurewithin the human brain the DNA helix many plants andeven the cosmos itself The golden ratio is also called thegolden section or golden mean

The golden mean and the numbers of the Fibonacciseries (0 1 1 2 3 5 8 ) have been used with significant

success in analyzing and predicting stock market motionElliott presented wave theory in which the frequent waverelationships are golden ratios (191 236 382 50618 and 809) It has a striking similarity to RNA foldingWe can determine that 0382 05 and 0618 are the onlyimportant RNA folding points (Tables 1 to 4) The aboveresults of statistical analysis also confirm this view Almostall sequences are folded at L and approximately half of thesequences are folded at points 0382L 05L and 0618L Forlong sequences the other two key fold points 0236L and0809L may be found which are 0382L times 0618 and 05L times1618

Therefore these points would be closer to natural foldingand obtain higher accuracy than before if RNA is dynamicallyfolded according to the above golden points


Table 3 Distribution of domain and subdomain for sequences with lengths between 50 and 90

Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 1 1 1 29 25 4 6 6 4 22 28 11 4 2 1 10 17 8 49

Table 4 Distribution of domains for tRNA

Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 3 0 2 14 15 3 2 14 3 10 14 6 5 0 1 6 5 6 33

(1) For (i = startLen i le sequence length)(11) Run the basic prediction algorithm of secondary structures to producematrix Z and trace back Z to obtain a base pair list L(12) Identify all helices in L and combine helices separated by small internalloops or bulges(13) Assign a score to each helix by summing up the scores of its constitutivebase pairs or stacks Select the helix H according to helix length and scoreand merge H into the base pair list S to be reported(14) Remove positions of H from the initial sequence(15) Assign next golden point to i

End for(2) Compute pseudoknots on the remaining sequences by the crossing of two

subsequences Trace back to obtain a base pair list K and merge K into S(3) Report the base pair list S and terminate

Algorithm 1

23 Dynamic Algorithm RNAs fold through a hierarchicalpathway in which the helices and loops are rapidly formedas secondary structures and the subsequent slow folding ofthe 3D tertiary structures would consolidate the secondarystructures [20 21] RNAs also fold during transcription fromDNA into RNA Therefore we first compute the secondarystructure and then predict pseudoknotted structures Wefold RNA secondary structures as DNA is transcribed intoRNA The length of RNA sequences is gradually increasedaccording to the above golden points and only reliable helicesare accepted

For example we first fold the subsequence 119904152

of TMVupwith a length of 52 that is 0618L and form the helix (1115 21 25) (Figure 3(a)) Then we fold the subsequence 119904

184

of TMVup with a length of 84 that is 0618L times 1618 =L and form the helix (37 42 49 54) (Figure 3(b)) Finallywe fold two pseudoknots (Figure 3(d))Figure 3(c) shows theintermediate result of the last step where the helix is formed(64 67 81 84)

One pseudoknot can consist of one helix and two subse-quences and one of the two is included in the helixThereforewe can compute the crossing of two subsequences Pseudo-knots consist of one internal and one external subsequenceof helix and secondary structures consist of two externalsubsequences of helicesThe residual sequence consists of sixsequences after secondary structure folding (Figure 3(c))Thesubsequences 119904

1620and 1199042636

form the helix (17 20 29 32)(Figure 3(d)) and the subsequences 119904

6880and 1199045563

form the

helix (55 58 69 74) (Figure 3(d)) The helices (55 58 69 74)and (64 67 81 84) form one pseudoknot and the helices (1720 29 32) and (11 15 21 25) form another pseudoknot

The sketch of the GM algorithm is as in Algorithm 1The method uses a dynamic programming strategy to

compute the secondary structure values of 119885(119894 119895) for all 119894and 119895 in the start subsequence The subsequence is theniteratively lengthened to the next golden point to computethe secondary structure values of 119885(119894 119895) for all 119894 and 119895 in thenew subsequence Finally the pseudoknots are computed bythe crossing of two subsequences and the values of119885(119894 119895) areprovided for all 119894 and 119895 in the entire sequence In this studywe use our previous LZ algorithm to compute pseudoknottedstructures [16]

For the sequence of TMVup the startLen is 0618L thatis 52 We fold the subsequence 119904

152and select the helix H

(11 15 21 25) with the maximum value assign 119894 as anothergolden point 1618times 52 that is L and select the helixH (37 4249 54)The LZ algorithm computes the residual subsequenceand forms the last structure The ratio of the latter goldenpoint to the former one is 1618 and the length of the startsubsequence should be between 40 and 70

The time complexity of steps 11 to 15 is O(1198993) andthe number of iterative computations is less than 10 forsequences with lengths less than 1000 Therefore the firststep takes O(1198993) time In this study the time complexityof step 2 is less than O(1198995) which is similar to that of the


LZ algorithm Therefore the time complexity of the GMalgorithm is maintained as O(1198995)

3 Results and Discussion

To illustrate the effect of our algorithm tests are dividedinto two parts one for pseudoknotted sequences and anotherfor mixed data of pseudoknot-free and pseudoknottedsequences We select two data sets One is TT2NE data set totest pseudoknotted structuresThis data set contains 47 pseu-doknotted sequences from PseudoBase and PDB Anotheris PKNOTS data set to test secondary and pseudoknottedstructures This data set includes 116 sequences including 25tRNA sequences randomly selected from the Sprinzl tRNAdatabaseHIV-1-RT-ligandRNApseudoknots and some viralRNAs

The accuracy of an algorithm is measured by bothsensitivity and PPV Let RP (real pair) be the number of basepairs in the real RNA structure TP (true positive) the numberof correctly predicted base pairs and FP (false positive) thenumber of wrongly predicted base pairs Burset and Guigo[35] defined SE (sensitivity) as TPRP and PPV (positivepredictive value) as TP(TP + FP) in 1996

31 Results of PKONTS Data Set In this section we test thePKNOTS data set which has become the benchmark datasetto predict RNA structures and present the prediction resultsof our method compared with the PKNOTS algorithm [14]and our previous LZ algorithm [16]

To explore the effect of the GM method in differentmodels we test twomodels and compare the difference beforeand after GM processing First we run PKONTS and LZalgorithm on the PKNOTS data set and obtain the outputof the results We then run the first step of the GM methodto form the frame of secondary structures by dynamicallyfolding sequences at the golden points and select one stablehelixwith theminimumenergy at each fold Subsequently weobtain the partially folded sequences as the input of PKONTSand LZ algorithm and run them with the same energy modeland parameters as above We fold all sequences of the test setand obtain the results The test results are shown in Table 5

For each algorithm the percentages of sensitivity andPPV are shown in Table 5 The value is the usual aver-age of sensitivity and PPV values of all sequences Thedetailed results are displayed as Supplementary data inSupplementary Material available online at httpdxdoiorg1011552014690340

The improved PKNOTS algorithm is compared with thePKNOTS algorithm in both sensitivity and PPV (Table 5)The improved PKNOTS algorithm increases the sensitivityfrom 828 to 855 and improves the PPV from 789 to808

The improved LZ algorithm is compared with theLZ algorithm in both sensitivity and PPV (Table 5) Theimproved LZ algorithm increases the sensitivity from 848to 877 and improves the PPV from 807 to 828

The improved LZ and PKNOTS algorithms both have 2to 3higher sensitivity than the LZ andPKNOTS algorithms

Table 5 Prediction results of improved PKNOTS and LZ algorithm

RNAs SE PPVPKNOTS-105 828 789Improved PKNOTS-105 855 808LZ 848 807Improved LZ 877 828

respectively The improved LZ algorithm outperforms thePKNOTS algorithm by 49The tests of improved PKNOTSand improved LZ indicate that the GM method may also beapplied to other algorithms of RNA structure prediction toimprove the prediction sensitivity and reduce the predictedredundant base pairs

Both of the improved LZ and PKNOTS algorithmsincrease the accuracy of many sequences (eg BiotonDF0660 DG7740 DI1140 DP1780 DV3200 and DY4840)from which we can determine the influence of the goldenmean characteristic

For example in GM the first DF0660 sequence is foldedat golden point 0618L and forms the helix (27 32 40 45)which exists in real structure except for one base pair (2745) (Figure 4(a)) This sequence is then folded at point L andselects the helix (2 7 67 72) which exists in real structure(Figure 4(b)) Subsequently this sequence selects two helices[(10 13 23 26) and (50 54 63 67)] which exist in realstructure (Figure 4(c))

However the PKNOTS algorithm selects two helices [(816 27 35) and (38 40 47 49)] which do not exist in realstructure (Figure 4(e)) Only the form of helix (27 32 40 45)at the first golden point prevents the formation of a large helix(8 16 27 35) in GM

For pseudoknotted structure in GM sequence TMVup isfolded to form two pseudoknots and one helix (37 42 49 54)which exist in real structure except for one redundant basepair (11 25) (Figure 3(d))The sequence alsomissed one helix(34 36 43 45) which is one pseudoknot (Figure 3(d))

In the PKNOTS algorithm sequence TMVup is folded toform four helices three of which are equal to those in GMbut one helix (29 33 56 59) is redundant (Figure 3(f)) Thissequence also missed all three pseudoknots Only the form ofthe helix (29 33 56 59) prevents the form of pseudoknots

Further statistical analysis shows that the selected helicesby the first step of GM basically belong to the real structuresand control the folding pathway This finding is preciselyascribed to the GM processing before PKONTS and LZimprove the accuracy

32 Results of TT2NE Data Set In this section we testthe TT2NE data set which includes 47 sequences fromPseudoBase and PDB which is also a subset of those used inthe HotKnots

We present the prediction results of our method com-pared with those of HotKnots [9] McQfold [10] ProbKnot[11] TT2NE [13] Mfold [36] and LZ [16] MFold is restrictedto secondary structures that are free of pseudoknots whereasothers can result in any topology of pseudoknot


1

27 45

G

C

U

G

G

A

C

C

A

G

U

GC

C

A

A

GG

UA

GUUCAG

CC

U

G

G

G

AG

AA

C

C

U

U

G A AG

A

(a) The first fold

12745 74

GCCAAGGGCUGGAC C A G U C C U U G G C

UA

GU

UCAGCCUGGGAG

AA

C

CU

U

GA

A G A UG

UC

GG G U G U U C G A A U

CA

CC

C A

(b) The second fold

1

26 45

74GCCAAGG

GUUCG A A C

GC

UG

GA U

CCAGU

GGGUGCACCCC

CUUGGC

UA

AGCCU

G G G A

CUGA AG

A

U GUC

U UC

GAAU

A

(c) Predicted structure by GM

1

26

44

74GCCAAGG

GUUCCG A A

CU

GG

A UCCAG

GG

GU

GC

AC

CCC

CUUGGC

UAAGCC

UGG G A G

CUGA AG

A

U UGUC

UUC

GAAU

A

(d) Native structure

1

35 49

74GCCAAGGUAGUUCAGC

G C U G G A C U GG

AUGU

CG G G U G

CACCCCCUUGGC

CUGGG

AG

A A C

AA

CC AG

UU

U UCG

AAU

A

(e) Predicted structure by PKNOTS

Figure 4 Predicted structure of sequence DF0660 (a) Status after the first fold of GM algorithm (b) Status after the second fold of GMalgorithm (c) Final fold and predicted structure by GM (d) Native structure of DF0660 (e) Final fold and predicted structure by PKNOTS

This set includes most of the sequences where HotKnotshas been tested and shown to perform better than ILM andPKnots-rg Thus we will not compare GM to these latteralgorithms

The total number of base pairs to be predicted in thisset is 1115 Mfold HotKnots McQfold ProbKnots TT2NELZ and GM predicted 618 671 740 669 870 785 and 798respectively The total numbers of predicted base pairs are1024 1019 991 1041 1146 1102 and 1112 respectively

The sensitivity of GM is better than that of the otheralgorithms except for TT2NE either on the average 1 or onthe average 2 (Table 6) The PPV of GM is better than that ofthe other algorithms except for TT2NE and McQfold eitheron the average 1 or on the average 2

However TT2NE has shown the most feasible value afterseveral times folding with different parametersThus TT2NEis not the autocomputing value When given new sequenceswe do not know which result should be selected

The part of computing pseudoknots in GM is the sameas that in the LZ algorithm and the difference between themlies in the preprocessing of secondary structures The perfor-mance ofGM is better than that of the LZ algorithm either onthe average 1 or on the average 2 GM improves the sensitivity

of sequences Bs glmS EC S15 and HDV antigenomic from36 59 and 44 to 51 100 and 84 respectively GM alsoimproves the PPVof these sequences from57 63 and 34 to 6774 and 64 respectively However for sequences AMV3 andBVDV the performance of GM is only below LZ

The number of predicted genera by GM is better thanthat by other algorithms except for TT2NE (Table 6) AllMfolds predictions have genus 0 because Mfold generatesonly structures without pseudoknots

However TT2NE predicted more redundancy generathan other algorithms For example GLV IRES R2 retro PKand 1y0q sequences have only one native pseudoknot buttwo pseudoknots are predicted by TT2NE Bs glmS has twonative pseudoknots but three are predicted by TT2NE

The percentages of sensitivity and PPV of each algorithmare shown The column Gen indicates the predicted numberof genera and GR is the redundancy number in the predictedgenus which is more than the number of native generaAverage 1 is the total number of base pairs that are correctlypredicted in the entire database divided by the correspondingtotal number of native base pairs (average sensitivity) and thetotal number of predicted base pairs (average PPV) Average2 is the usual average of sensitivity and PPV values of all


Table 6 Prediction results of TT2NE data set

Mfold HotKnots McQfold ProbKnot TT2NE LZ GMSE PPV SE PPV SE PPV SE PPV SE PPV SE PPV SE PPV

Average 1 55 60 60 67 66 75 60 64 78 76 70 71 72 72Average 2 50 58 63 67 68 77 54 63 81 80 73 73 75 74Sum 0 0 18 0 25 0 3 0 47 4 28 0 31 0

Gen GR Gen GR Gen GR Gen GR Gen GR Gen GR Gen GR

sequences The detailed results are displayed as Supplemen-tary data

4 Conclusions

In this study we provide a novel report that RNA foldingaccords with the golden mean characteristic based on thestatistical analysis of real RNA secondary structuresThe fold-ing 31015840-end points of the sequence are almost close to 0382L05L 0618L and L These points are the important goldensections of sequence Applying this characteristic we designa GM algorithm by dynamically folding RNA secondarystructures according to the above golden section points andby forming pseudoknots with the crossing of subsequencesWe implement the method using thermodynamic data andtest its performance on PKNOTS and TT2NE data sets

For PKNOTS data set we preprocess the sequence withthe first step of GM and then obtain the output of thepartially folded sequence as the input of the PKNOTS andLZ algorithms Subsequently the two algorithms improve by2 to 3 The reason is that the partial folded sequenceforms its structural frame In other words the folding at thegolden points controls the folding pathway and subsequentlyprevents the formation of some redundant structures

Preprocessing of GM may also be applied to other algo-rithms of RNA structure prediction to improve the predictionaccuracy and to reduce the predicted redundant base pairs

For TT2NE data set the sensitivity and number ofpredicted pseudoknots of GM are better than those of MfoldHotKnots McQfold ProbKnot and LZ The PPV of GM isbetter than that of Mfold HotKnot ProbKnot and LZThesefindings indicate that the GMmethod has good performancein predicting secondary and pseudoknotted structures Thesensitivity and PPV of the GM algorithm surpass those ofmost algorithms of RNA structure prediction The experi-mental results reflect the folding rules of RNA from a newangle that is close to natural folding

The performance of long sequence needs to be furtherexplored and parameters for improvement and testing inlarge data sets to support web service are topics for futurestudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was partly supported by NSFC under Grant nos61070019 and 61272431 the Shan Dong Province NaturalScience Foundation of China under Grant nos ZR2011FL029and ZR2013FM016 the Open Project Program of the Shan-dong Provincial Key Lab of Software Engineering underGrant no 2011SE004 and the Program for Scientific ResearchInnovation Team in Colleges and Universities of ShandongProvince The authors thank the anonymous reviewers fortheir detailed and useful comments

References

[1] D W Staple and S E Butcher ldquoPseudoknots RNA structureswith diverse functionsrdquo PLoS Biology vol 3 no 6 article e2132005

[2] M Zuker and P Stiegler ldquoOptimal computer folding of largeRNA sequences using thermodynamics and auxiliary informa-tionrdquo Nucleic Acids Research vol 9 no 1 pp 133ndash148 1981

[3] D H Mathews J Sabina M Zuker and D H TurnerldquoExpanded sequence dependence of thermodynamic parame-ters improves prediction of RNA secondary structurerdquo Journalof Molecular Biology vol 288 no 5 pp 911ndash940 1999

[4] V Akmaev S Kelley and G Stormo ldquoA phylogenetic approachto RNA structure predictionrdquo in Proceedings of the InternationalConference on Intelligent Systems forMolecular Biology vol 7 pp10ndash17 1999

[5] I L Hofacker ldquoVienna RNA secondary structure serverrdquoNucleic Acids Research vol 31 no 13 pp 3429ndash3431 2003

[6] DHMathews andDH Turner ldquoPrediction of RNA secondarystructure by free energy minimizationrdquo Current Opinion inStructural Biology vol 16 no 3 pp 270ndash278 2006

[7] S R Eddy and R Durbin ldquoRNA sequence analysis usingcovariance modelsrdquo Nucleic Acids Research vol 22 no 11 pp2079ndash2088 1994

[8] S Ieong M Kao T W Lam W K Sung and S M YiuldquoPredicting RNA secondary structures with arbitrary pseudo-knots by maximizing the number of stacking pairsrdquo Journal ofComputational Biology vol 10 no 6 pp 981ndash995 2003

[9] J Ren B Rastegari A Condon and H H Hoos ldquoHotKnotsheuristic prediction of RNA secondary structures includingpseudoknotsrdquo RNA vol 11 no 10 pp 1494ndash1504 2005

[10] D Metzler and M E Nebel ldquoPredicting RNA secondarystructures with pseudoknots by MCMC samplingrdquo Journal ofMathematical Biology vol 56 no 1-2 pp 161ndash181 2008

[11] S Bellaousov and D H Mathews ldquoProbKnot fast prediction ofRNA secondary structure including pseudoknotsrdquo RNA vol 16no 10 pp 1870ndash1880 2010


[12] X Chen S He D Bu et al ldquoFlexStem improving predictionsof RNA secondary structures with pseudoknots by reducingthe search spacerdquo Bioinformatics vol 24 no 18 pp 1994ndash20012008

[13] M Bon and H Orland ldquoTT2NE a novel algorithm to predictRNA secondary structures with pseudoknotsrdquo Nucleic AcidsResearch vol 39 no 14 p e93 2011

[14] E Rivas and S R Eddy ldquoA dynamic programming algorithmfor RNA structure prediction including pseudoknotsrdquo Journalof Molecular Biology vol 285 no 5 pp 2053ndash2068 1999

[15] J Reeder and R Giegerich ldquoDesign implementation andevaluation of a practical pseudoknot folding algorithmbased onthermodynamicsrdquo BMC Bioinformatics vol 5 article 104 2004

[16] L Hengwu Z Daming L Zhendong and L Hong ldquoPredictionfor RNA planar pseudoknotsrdquo Progress in Natural Science vol17 no 6 pp 717ndash724 2007

[17] I Barrette G Poisson P Gendron and F Major ldquoPseudoknotsin prion protein mRNAs confirmed by comparative sequenceanalysis and pattern searchingrdquo Nucleic Acids Research vol 29no 3 pp 753ndash758 2001

[18] F H D van Batenburg A P Gultyaev C W A Pleij J Ng andJ Oliehoek ldquoPseudoBase a database with RNA pseudoknotsrdquoNucleic Acids Research vol 28 no 1 pp 201ndash204 2000

[19] K C Wiese and A Hendriks ldquoComparison of P-RnaPredictandmfold-algorithms forRNAsecondary structure predictionrdquoBioinformatics vol 22 no 8 pp 934ndash942 2006

[20] C Shijie T Zhijie C Song and Z Wenbing ldquoThe statisticalmechanics of RNA foldingrdquo Physics vol 3 pp 218ndash229 2006

[21] W Chuanming P Min and C Kui ldquoRNA foldingrdquo Nature vol26 pp 249ndash255 2004

[22] R Padovan ldquoProportion science philosophy architecturerdquoNexus Network Journal vol 4 no 1 pp 113ndash122 2002

[23] J C Perez ldquoChaos DNA and neuro-computers a golden linkrdquoSpeculations in Science and Technology vol 14 no 4 pp 336ndash346 1991

[24] M E B Yamagishi and A I Shimabukuro ldquoNucleotide fre-quencies in human genome and Fibonacci numbersrdquo Bulletinof Mathematical Biology vol 70 no 3 pp 643ndash653 2008

[25] J C Perez ldquoCodon populations in single-stranded wholehuman genome DNA Are fractal and fine-tuned by the GoldenRatio 1618rdquo Interdisciplinary Sciences Computational Life Sci-ences vol 2 no 3 pp 228ndash240 2010

[26] D C Phillips ldquoThe three-dimensional structure of an enzymemoleculerdquo Scientific American vol 215 no 5 pp 78ndash90 1966

[27] J Drenth J N Jansonius R Koekoek H M Swen and B GWolthers ldquoStructure of papainrdquo Nature vol 218 no 5145 pp929ndash932 1968

[28] R R Porter ldquoStructural studies of immunoglobulinsrdquo Sciencevol 180 no 4087 pp 713ndash716 1973

[29] G M Edelman ldquoAntibody structure and molecular immunol-ogyrdquo Science vol 180 no 4088 pp 830ndash840 1973

[30] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[31] P Bork ldquoShuffled domains in extracellular proteinsrdquo FEBSLetters vol 286 no 1-2 pp 47ndash54 1991

[32] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[33] S A Islam J Loo and M J E Sternberg ldquoIdentification andanalysis of domains in proteinsrdquo Protein Engineering vol 8 no6 pp 513ndash525 1995

[34] S J Wheelan A Marchler-Bauer and S H Bryant ldquoDomainsize distributions can predict domain boundariesrdquo Bioinformat-ics vol 16 no 7 pp 613ndash618 2000

[35] M Burset and R Guigo ldquoEvaluation of gene structure predic-tion programsrdquo Genomics vol 34 no 3 pp 353ndash367 1996

[36] M Zuker ldquoMfold web server for nucleic acid folding andhybridization predictionrdquoNucleic Acids Research vol 31 no 13pp 3406ndash3415 2003

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research



Microbiology


methods have been successful for relatively short RNAsMan-ually comparative approaches are more reliable than ther-modynamic approaches when many homologous sequencesare available Manually comparative approaches have beenused to establish the structures of known RNA familiesThese approaches compute a consensus structure on a set ofaligned RNA sequences by searching for covariance evidencebetween each of the base pairs Quantitative measures ofcovariance have been implemented in 1205942 statistics andmutual information Akmaev et al [4] also extended theseapproaches to explicitly consider sequence phylogeny andshowed positive results Hybrid approaches which haverecently emerged combine the advantages of thermodynamicand comparative approaches [5] Hybrid approaches considerboth thermodynamic stability and sequence covariance andproduce positive results on as few as three homologoussequences Other methods cannot be classified into any ofthese three families A few of these methods attempt tosimultaneously align and fold homologous sequences [6]Eddy and Durbin [7] introduced stochastic context-freegrammars to align homologous sequences iteratively andfound a consensus structure for them

A pseudoknot motif is a prevalent RNA structure Pseu-doknots serve various functions in biology [1] Plausible pseu-doknotted structures have been proposed and confirmed forthe 31015840-end of several plant viral RNAs where pseudoknots areapparently used tomimic tRNA structures Pseudoknots havebeen recently confirmed in some RNAs of humans and otherspecies [6]

Current studies on RNA secondary structure predictionhave not considered pseudoknots Optimizing secondarystructures including arbitrary pseudoknots is NP-hard [8]

Most RNA folding methods that can fold pseudoknotsadopt heuristic search procedures and sacrifice optimal-ity Examples of these methods include quasi-Monte Carlosearches and genetic algorithms These methods cannotguarantee the most optimal structure and cannot determinethe accuracy of a given prediction toward optimality [9ndash13]

A different approach to pseudoknot prediction adoptsdynamic programming to predict the tractable subclass ofpseudoknots based on complex thermodynamic models inO(1198994)ndashO(1198996) time [14ndash16] making them impractical even forsequences of a few hundred bases long

Comparative approaches can also be applied to predictpseudoknots and are more reliable than thermodynamicapproaches For example comparative analysis has revealedthe existence of pseudoknots in several RNAs [17] Howevercomparative analysis has typically been conducted in an adhoc manner from an algorithmic point of view

RNAs fold during transcription from DNA into RNACurrent RNA structure prediction by calculating the globaloptimal structure does not reflect the dynamic folding mech-anism of RNA [18]

Although DP can accurately predict a minimum energystructure within a given thermodynamic model the nativefold is often in a suboptimal energy state that significantlyvaries from the predicted one [19] A case may be made thatthe natural folding process of RNA and the simulated folding

of RNA using an evolutionary algorithm which includesintermediate folds have much in common [20 21]

The current study provides a novel report that RNAfolding accords with the golden mean characteristic basedon the statistical analysis of real RNA secondary structuresThe golden mean is also called the golden section or goldenratio Adolf Zeising found the golden ratio expressed in thearrangement of branches along the stems of plants and ofveins in leaves [22] He extended his research to the skeletonsof animals and the branches of their veins and nervesto the proportions of chemical compounds and geometryof crystals and even to the use of proportion in artisticendeavors In these phenomena he found the golden ratiooperating as a universal law In 2010 the journal Sciencereported that the golden ratio is present at the atomic scalein the magnetic resonance of spins in cobalt niobate crystals[23] Several researchers have proposed connections betweenthe golden ratio and human genome DNA [24 25]

Applying this characteristic we design a golden mean(GM) algorithm by dynamically folding RNA secondarystructures according to the golden section points andby forming pseudoknots subsequently folded betweennucleotides that did not pair in previous steps

We implement the method using thermodynamic dataand test the performance on PKNOTS and TT2NE data setFor PKNOTS data set the sensitivity and PPV of the Lhw-Zhu (LZ) and PKNOTS algorithms are increased by 2 to 3via the preprocessing of theGMmethod For TT2NEdata setthe GM method indicates good performance in predictingsecondary and pseudoknotted structures The experimentalresults reflect the folding rules of RNA from a new angle thatis close to natural folding

2 Materials and Methods

21 Structure Prediction Let sequence 119904 = 11990411199042sdot sdot sdot 119904119899be

a single-stranded RNA molecule where each base is 119904119894isin

AUCG 1 le 119894 le 119899 The subsequence 119904119894119895= 119904119894119904119894+ 1 sdot sdot sdot 119904

119895is

a segment of 119904 1 le 119894 le 119895 le 119899For first approximation the secondary structure is mod-

eled as follows If 119904119894and 119904

119895are complementary bases

(AampUCampGUampG) then 119904119894and 119904119895may constitute a base pair

(119894 119895) Each base can occur in one base pair or the set of basepairs can form a matching The secondary structures are alsononcrossing

Concretely a secondary structure 119878 on 119904 is a set of basepairs 119878 = (119894 119895) where 119894 119895 isin 1 2 119899 that satisfies thefollowing conditions

No Sharp Turns The ends of each pair in 119878 are separated by atleast four intervening bases that is if (119894 119895) isin 119878 then 119894 lt 119895 minus 3

For any pair (119894 119895) in 119878 (119894 119895) isin (AU) (CG) (UG)(UA) (GC) (GU)119878 is a matching no base appears in more than one pair

Noncrossing Condition If (119894 119895) and (119896 119897) are two pairs in 119878then they are compatible that is they are juxtaposed (eg119894 lt 119895 lt 119896 lt 119897) or nested (eg 119894 lt 119896 lt 119897 lt 119895)







1015840 119895
















50

40

30

20

10

0

01 02 03 038 045 055 062 07 08 09 1


in RNA STRAND

Num

ber o

f seq

uenc

es



20

25

30

35

10

15

0

5Num

ber o

f seq

uenc

es

01 02 03 038 045 055 062 07 08 09 1








1

11 25

52

ACGGA

CCGU

U

AUA

AU

AA

A U A

UU G U

G

A A UC

AC

ACGUG

GU

GC

GUACGAUAA

CG

C


1

11

25

37

5484

ACGGAU C C G U

GU

GC

GU A

CGCAU

AUAAUAA

AU

AUUGUG

AAUCAC

ACGUG

ACG A U

A

A G U G U U UUUCCC

UCCACUUA

AA

UC

GAAGGG

1

11

25

37

54

64

84

ACGGAU C C G U

G UG C G UACGCAU

CC

CUA

GG

GAUAA

UA

AAUA

UUGU G

AA U C A C A C G UG

A C GAUA

AGUGU

UU

UU

C CACU

UAAAUCGA

1

11

32

54

84

A C G G A

UGUGUCCGU

C A C A

GU

GC

GU

AC

GC

AU

AG

UG

CC

CU

CA

CU

AG

GGAU

AA

UA

AAUA

UAA

U

C G UG

AC G

AU

A

UUU

UU

C

UA

AA

UCGA


1

11 25

37

59

84

AC

GG

A UCCGU

A C A CG U G C G U

ACGCAUGUGU

CC

CU

AG

GGA

UA

AUAA

AU

A

UUG U

G

A A UC

GU G

A C GA

UA

A

UU

UU

C C ACU

UAAAU

CGA


1

1332

54

84

GG

A

UG

UG

UC

C

CA

CA

CG

U

GU

GC

GU

AC

GA

CG

CA

U

AG

UG

CC

CU

CC

AC

U

GA

GG

GAUAA

UA

AA

UAAC

UGUA A U G

A U A

UUU

UU

UAA

AUCA





















Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 1 1 1 29 25 4 6 6 4 22 28 11 4 2 1 10 17 8 49


Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 3 0 2 14 15 3 2 14 3 10 14 6 5 0 1 6 5 6 33




Algorithm 1




184



1620and 1199042636


6880and 1199045563

form the































1

27 45

G

C

U

G

G

A

C

C

A

G

U

GC

C

A

A

GG

UA

GUUCAG

CC

U

G

G

G

AG

AA

C

C

U

U

G A AG

A

(a) The first fold

12745 74


UA

GU

UCAGCCUGGGAG

AA

C

CU

U

GA

A G A UG

UC


CA

CC

C A

(b) The second fold

1

26 45

74GCCAAGG

GUUCG A A C

GC

UG

GA U

CCAGU

GGGUGCACCCC

CUUGGC

UA

AGCCU

G G G A

CUGA AG

A

U GUC

U UC

GAAU

A


1

26

44

74GCCAAGG

GUUCCG A A

CU

GG

A UCCAG

GG

GU

GC

AC

CCC

CUUGGC

UAAGCC

UGG G A G

CUGA AG

A

U UGUC

UUC

GAAU

A


1

35 49

74GCCAAGGUAGUUCAGC

G C U G G A C U GG

AUGU

CG G G U G

CACCCCCUUGGC

CUGGG

AG

A A C

AA

CC AG

UU

U UCG

AAU

A


















4 Conclusions








Acknowledgments


References













































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology







1015840 119895
















50

40

30

20

10

0

01 02 03 038 045 055 062 07 08 09 1


in RNA STRAND

Num

ber o

f seq

uenc

es



20

25

30

35

10

15

0

5Num

ber o

f seq

uenc

es

01 02 03 038 045 055 062 07 08 09 1








1

11 25

52

ACGGA

CCGU

U

AUA

AU

AA

A U A

UU G U

G

A A UC

AC

ACGUG

GU

GC

GUACGAUAA

CG

C


1

11

25

37

5484

ACGGAU C C G U

GU

GC

GU A

CGCAU

AUAAUAA

AU

AUUGUG

AAUCAC

ACGUG

ACG A U

A

A G U G U U UUUCCC

UCCACUUA

AA

UC

GAAGGG

1

11

25

37

54

64

84

ACGGAU C C G U

G UG C G UACGCAU

CC

CUA

GG

GAUAA

UA

AAUA

UUGU G

AA U C A C A C G UG

A C GAUA

AGUGU

UU

UU

C CACU

UAAAUCGA

1

11

32

54

84

A C G G A

UGUGUCCGU

C A C A

GU

GC

GU

AC

GC

AU

AG

UG

CC

CU

CA

CU

AG

GGAU

AA

UA

AAUA

UAA

U

C G UG

AC G

AU

A

UUU

UU

C

UA

AA

UCGA


1

11 25

37

59

84

AC

GG

A UCCGU

A C A CG U G C G U

ACGCAUGUGU

CC

CU

AG

GGA

UA

AUAA

AU

A

UUG U

G

A A UC

GU G

A C GA

UA

A

UU

UU

C C ACU

UAAAU

CGA


1

1332

54

84

GG

A

UG

UG

UC

C

CA

CA

CG

U

GU

GC

GU

AC

GA

CG

CA

U

AG

UG

CC

CU

CC

AC

U

GA

GG

GAUAA

UA

AA

UAAC

UGUA A U G

A U A

UUU

UU

UAA

AUCA





















Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 1 1 1 29 25 4 6 6 4 22 28 11 4 2 1 10 17 8 49


Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 3 0 2 14 15 3 2 14 3 10 14 6 5 0 1 6 5 6 33




Algorithm 1




184



1620and 1199042636


6880and 1199045563

form the































1

27 45

G

C

U

G

G

A

C

C

A

G

U

GC

C

A

A

GG

UA

GUUCAG

CC

U

G

G

G

AG

AA

C

C

U

U

G A AG

A

(a) The first fold

12745 74


UA

GU

UCAGCCUGGGAG

AA

C

CU

U

GA

A G A UG

UC


CA

CC

C A

(b) The second fold

1

26 45

74GCCAAGG

GUUCG A A C

GC

UG

GA U

CCAGU

GGGUGCACCCC

CUUGGC

UA

AGCCU

G G G A

CUGA AG

A

U GUC

U UC

GAAU

A


1

26

44

74GCCAAGG

GUUCCG A A

CU

GG

A UCCAG

GG

GU

GC

AC

CCC

CUUGGC

UAAGCC

UGG G A G

CUGA AG

A

U UGUC

UUC

GAAU

A


1

35 49

74GCCAAGGUAGUUCAGC

G C U G G A C U GG

AUGU

CG G G U G

CACCCCCUUGGC

CUGGG

AG

A A C

AA

CC AG

UU

U UCG

AAU

A


















4 Conclusions








Acknowledgments


References













































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


1

11 25

52

ACGGA

CCGU

U

AUA

AU

AA

A U A

UU G U

G

A A UC

AC

ACGUG

GU

GC

GUACGAUAA

CG

C


1

11

25

37

5484

ACGGAU C C G U

GU

GC

GU A

CGCAU

AUAAUAA

AU

AUUGUG

AAUCAC

ACGUG

ACG A U

A

A G U G U U UUUCCC

UCCACUUA

AA

UC

GAAGGG

1

11

25

37

54

64

84

ACGGAU C C G U

G UG C G UACGCAU

CC

CUA

GG

GAUAA

UA

AAUA

UUGU G

AA U C A C A C G UG

A C GAUA

AGUGU

UU

UU

C CACU

UAAAUCGA

1

11

32

54

84

A C G G A

UGUGUCCGU

C A C A

GU

GC

GU

AC

GC

AU

AG

UG

CC

CU

CA

CU

AG

GGAU

AA

UA

AAUA

UAA

U

C G UG

AC G

AU

A

UUU

UU

C

UA

AA

UCGA


1

11 25

37

59

84

AC

GG

A UCCGU

A C A CG U G C G U

ACGCAUGUGU

CC

CU

AG

GGA

UA

AUAA

AU

A

UUG U

G

A A UC

GU G

A C GA

UA

A

UU

UU

C C ACU

UAAAU

CGA


1

1332

54

84

GG

A

UG

UG

UC

C

CA

CA

CG

U

GU

GC

GU

AC

GA

CG

CA

U

AG

UG

CC

CU

CC

AC

U

GA

GG

GAUAA

UA

AA

UAAC

UGUA A U G

A U A

UUU

UU

UAA

AUCA





















Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 1 1 1 29 25 4 6 6 4 22 28 11 4 2 1 10 17 8 49


Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 3 0 2 14 15 3 2 14 3 10 14 6 5 0 1 6 5 6 33




Algorithm 1




184



1620and 1199042636


6880and 1199045563

form the































1

27 45

G

C

U

G

G

A

C

C

A

G

U

GC

C

A

A

GG

UA

GUUCAG

CC

U

G

G

G

AG

AA

C

C

U

U

G A AG

A

(a) The first fold

12745 74


UA

GU

UCAGCCUGGGAG

AA

C

CU

U

GA

A G A UG

UC


CA

CC

C A

(b) The second fold

1

26 45

74GCCAAGG

GUUCG A A C

GC

UG

GA U

CCAGU

GGGUGCACCCC

CUUGGC

UA

AGCCU

G G G A

CUGA AG

A

U GUC

U UC

GAAU

A


1

26

44

74GCCAAGG

GUUCCG A A

CU

GG

A UCCAG

GG

GU

GC

AC

CCC

CUUGGC

UAAGCC

UGG G A G

CUGA AG

A

U UGUC

UUC

GAAU

A


1

35 49

74GCCAAGGUAGUUCAGC

G C U G G A C U GG

AUGU

CG G G U G

CACCCCCUUGGC

CUGGG

AG

A A C

AA

CC AG

UU

U UCG

AAU

A


















4 Conclusions








Acknowledgments


References













































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology













Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 1 1 1 29 25 4 6 6 4 22 28 11 4 2 1 10 17 8 49


Ratio 005 01 015 02 025 03 035 0382 040 045 05 055 06 0618 065 07 075 08 085 09 095 1Number 1 0 0 3 0 2 14 15 3 2 14 3 10 14 6 5 0 1 6 5 6 33




Algorithm 1




184



1620and 1199042636


6880and 1199045563

form the































1

27 45

G

C

U

G

G

A

C

C

A

G

U

GC

C

A

A

GG

UA

GUUCAG

CC

U

G

G

G

AG

AA

C

C

U

U

G A AG

A

(a) The first fold

12745 74


UA

GU

UCAGCCUGGGAG

AA

C

CU

U

GA

A G A UG

UC


CA

CC

C A

(b) The second fold

1

26 45

74GCCAAGG

GUUCG A A C

GC

UG

GA U

CCAGU

GGGUGCACCCC

CUUGGC

UA

AGCCU

G G G A

CUGA AG

A

U GUC

U UC

GAAU

A


1

26

44

74GCCAAGG

GUUCCG A A

CU

GG

A UCCAG

GG

GU

GC

AC

CCC

CUUGGC

UAAGCC

UGG G A G

CUGA AG

A

U UGUC

UUC

GAAU

A


1

35 49

74GCCAAGGUAGUUCAGC

G C U G G A C U GG

AUGU

CG G G U G

CACCCCCUUGGC

CUGGG

AG

A A C

AA

CC AG

UU

U UCG

AAU

A


















4 Conclusions








Acknowledgments


References













































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology
























1

27 45

G

C

U

G

G

A

C

C

A

G

U

GC

C

A

A

GG

UA

GUUCAG

CC

U

G

G

G

AG

AA

C

C

U

U

G A AG

A

(a) The first fold

12745 74


UA

GU

UCAGCCUGGGAG

AA

C

CU

U

GA

A G A UG

UC


CA

CC

C A

(b) The second fold

1

26 45

74GCCAAGG

GUUCG A A C

GC

UG

GA U

CCAGU

GGGUGCACCCC

CUUGGC

UA

AGCCU

G G G A

CUGA AG

A

U GUC

U UC

GAAU

A


1

26

44

74GCCAAGG

GUUCCG A A

CU

GG

A UCCAG

GG

GU

GC

AC

CCC

CUUGGC

UAAGCC

UGG G A G

CUGA AG

A

U UGUC

UUC

GAAU

A


1

35 49

74GCCAAGGUAGUUCAGC

G C U G G A C U GG

AUGU

CG G G U G

CACCCCCUUGGC

CUGGG

AG

A A C

AA

CC AG

UU

U UCG

AAU

A


















4 Conclusions








Acknowledgments


References













































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology







4 Conclusions








Acknowledgments


References













































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology

Research Article Characteristics and Prediction of RNA …Research Article Characteristics and...

Documents

Transcript of Research Article Characteristics and Prediction of RNA …Research Article Characteristics and...