Analysis of RNA Sequence Structure Maps by Exhaustive ... · Analysis of RNA Sequence Structure...

Analysis of RNA SequenceStructure Maps by ExhaustiveEnumerationWalter GrünerRobert GiegerichDirk StorthmannChristian ReidysJacqueline Weber

SFI WORKING PAPER: 1995-10-099

SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu

SANTA FE INSTITUTE

Analysis of RNA Sequence Structure Maps

by Exhaustive Enumeration

By

Walter Gr�unera� Robert Giegerichb� Dirk Strothmannb�

Christian Reidysa� Jacqueline Webera�

Ivo L� Hofackerc� Peter F� Stadlerc�d� and Peter Schustera�c�d��

aInstitut f�ur Molekulare Biotechnologie

Beutenbergstra�e �� D�� Jena� Germany

bTechnische Fakult�at� Univ Bielefeld

Postfach �� D�� Bielefeld� Germany

cInstitut f�ur Theoretische Chemie� Universit�at Wien

W�ahringerstra�e �� A�� Wien� Austria

dSanta Fe Institute

�� Hyde Park Rd� Santa Fe� NM �� USA

�Mailing Address�Institut f�ur Molekulare Biotechnologie

Beutenbergstra�e �� PF �� D�� Jena� GermanyPhone� �� Fax� ��

E�Mail� pks�imb�jena�de

Gr�uner et al�� Analysis of RNA Sequence�Structure Maps

Abstract

Global relations between RNA sequences and secondary structues are understood as mappingsfrom sequence space into shape space� These mappings are investigated by exhaustive foldingof all GC and AU sequences with chain lengths up to �� The technique od tries is usedfor economic data storage and fast retrieval of information� The computed structural data areevaluated through exhaustive enumeration and used as an exact reference for testing analyticalresults derived from mathematical models and sampling based of statistical methods� Severalnew concepts of RNA sequence to secondary structure mappings are investigated� among themthe structure of neutral networks �being sets of sequences folding into the same structure��percolation of sequence space by neutral networks� and the principle of shape space covering �The data of exhaustive enumeration are compared to the analytical results of a random graphmodel that reveals the generic properties of sequence to structure mappings based on some basepairing logic� The di�erences between the numerical and the analytical results are interpreted in

terms of specic biophysical properties of RNA molecules�

�� Introduction

Conventional biophysics considers sequence structure relations of biopolymers pri�

marily aiming at a solution of the folding problem� given a sequence which struc�

ture does it form under the speci�ed experimental conditions� for example the

minimum free energy structure formed at thermodynamic equilibrium or kineti�

cally determined structures corresponding to the outcome of the folding process

under certain conditions Many problems in current molecular biology and bio�

technology �� however� raise questions that cannot be answered satisfactorily by

this approach Required is instead a di�erent view that considers the set of all

�possible� sequences as an entity which is mapped onto the set of all �possible�

structures Such problems are� for example� the sensitivity of structures against

mutations in the underlying sequences �� the in�uence of nucleotide distributions

��A� �U� �G� �C� on structures �� as well as the inverse folding problem�

given a RNA secondary structure which sequences fold into it under the speci�ed

conditions ��

The evolution of RNA molecules in replication assays� viroids and RNA viruses

can be viewed as an adaptation process on a �tness landscape in the sense of

� � �


Sewall Wright�s imagination �� The dynamics of evolution is thus tightly linked

to the structure of an underlying landscape Global features of landscapes can be

described by statistical measures like numbers of optima� lengths of walks� and

correlation functions� see� for example� �� Statistical characteristics of

RNA landscapes are accessible on the level of secondary structures by mathemati�

cal analysis and computer calculations� these RNA landscapes belong to the same

class as well known optimization problems and simple spin glass models ��

The notion of a landscape has been extended to combinatory maps� thereby allo�

wing for a direct statistical investigation of the sequence structure relationships of

RNA at the level of secondary structures �� Extensive computational studies

have revealed that the frequencies of structures are highly non�uniform� that se�

quences sharing the same structure are distributed randomly over sequence space�

that there exist neutral paths in sequence space along which structures remain

une�ected by mutations� and that any desired secondary structure is formed by

sequences that can be found close to an arbitrary initial sequence These results

provide convincing evidence that RNA landscapes are as simple as they could

possibly be for evolutionary adaptation The consequences for evolutionary op�

timization� the early stages of life� and molecular biotechnology are immediate

Based on these �ndings a random�graph theory was developed �� that explains

the structure of neutral networks in terms of a single parameter� the frequency of

neutral mutations The predictions of this theory� among them connectedness and

density of the neutral networks cannot be veri�ed by a statistical approach based

on sampling tiny fractions of sequence space Comparison of the results derived

from the random graph approach with real RNA folding data allows to separate

generic properties of sequence to structure mappings from nucleotide speci�c bio�

chemical phenomena

In this contribution we report the computational techniques that are necessary

to exhaustively generate all sequences and their secondary structures for two�

letter alphabets up to a chain length of n � �� In essence� the main task is

� � �


easily stated� compute the secondary structures of all sequences and then group

together the sequences that fold into the same secondary structure� ie� produce

an explicit representation of all neutral networks Once this has been done the

analysis proceeds by determining the geometric structure of these networks

This paper is organized as follows� In section � we review the folding algorithm

and the data structures used to represent RNA structures In section � we present

data obtained directly from the exhaustive search� such as the over�all number

of di�erent minimum free energy structures� the fraction of open structures� and

the distribution of preimage sizes Section � is analyzes and discusses the internal

structures of neutral networks� in particular the size distributions of their connec�

ted components Section � gives a glimpse at the relative locations of neutral

networks belonging to di�erent structures We present strong evidence for �shape

space covering�� almost all common structures can be found within a fairly small

ball around any random sequence Section � contains a discussion of the results

and in particular a comparison with the random graph approach

� � �


�� RNA Secondary Structures

��RNA Folding

The biochemical and biophysical properties of RNA molecules are determined by

their spatial structures In case of RNA the process of folding the one�dimensional

primary structure �sequence� into the three�dimensional tertiary structure can be

decomposed into two steps�

�� Folding of the sequence into a secondary structure by formation of comple�

mentary Watson�Crick base pairs� G�C and A�U� and the weaker G�U

pairs

�� Formation of the three�dimensional tertiary structure from the planar pattern

Such a decomposition is meaningful since the intramolecular forces stabilizing the

secondary structures base pairing and base pair stacking are much stronger

than those accounting for arrangement of the secondary structure elements in

space Thus the free energy of formation for the three�dimensional structure can

be estimated by the free energy of the formation for the secondary structure The

dominant role of secondary structures is also well documented in nature since

secondary structure elements are conserved in evolution ��

Table �� Folding Algorithms for RNA Secondary Structures

Algorithm � Abbr� Remark Reference

deterministic

Minimum Free Energy � MFE fast ��

Kinetic Folding ! KIN fast ��

�� Folding ! �� fast ��

Partition Function � PF ensemble ��

Maximum Matching � MM unrealistic ��

stochastic

Simulated Annealing ! SA very slow ��

� Pseudo�knots can be included� The major problem with the prediction of pseudo�knots is�however� the lack of su�cient experimental energy parameters�

� � �


A variety of computer programs predicting RNA secondary structures have been

published A very brief overview is given in Table � Two public domain packages

for RNA folding are currently available by anonymous ftp� Zuker�s mfold ��

and the Vienna RNA Package �� All these programs make use of essentially the

same energy model for the formation of secondary structures It explicitly assumes

that there are no knots or pseudo�knots

interior base pairs

G

C

A

U

C G

A A

5

3

closing base pair

multiple loop

5

3

5

closing base pair

G

C A

U

interior base pair

5

closing base pair

G

C

A

U

C

3

5

3

closing base pair

stacking pair

interior loop

G A

UC

G

G

G

C

CA

A5

3

closing base pairinterior base pair

hairpin loop

bulge

UAinterior base

pair

3U A C G C A A U

joint free end

joints and free ends

Figure �� RNA secondary structures can be decomposed into loops and external elements� Theenergy of a particular structure is the sum of energy contributions for all its loops�

Each secondary structure is viewed as being composed of stacked base pairs� loops�

and external elements which are neither part of a stack nor of a loop For the sake

of a uniform notation two stacked base pairs can be viewed formally as a special

type of loop consisting of exactly four nucleotides Depending on the topology of

the loop one distinguishes di�erent loop types� hairpin loops have only one base

pair� stacked base pairs consist of exactly two base pairs� bulges have two base pairs

� � �


adjacent to each other and at least one unpaired base� interior loops have two base

pairs which are not adjacent to each other� and multi�loops contain at least three

base pairs� see Figure � The energy of a secondary structure is the sum of energy

contributions of all loops in the structure These contributions depend on the

loop type� the loop size� and partly on the particular sequence of nucleotides in

the loop The individual energy parameters were determined experimentally� see

eg �� While the prediction of the folding algorithm for a particular

sequence depends strongly on the details of the parameter set� such details do not

a�ect the global features of the sequence structure relations ��

The data used in this contribution have been produced with the fold�� algorithm

contained in the Vienna RNA Package �� It is based on a high performance

implementation �� of the Zuker�Sanko� algorithm �� The energy parameters

used in this package are an updated version of �� provided by Danielle Konings

��

��Representation of Secondary Structures

RNA secondary structures are commonly drawn as secondary structure graphs in

the biochemical literature Equivalently� though less intuitively� circle plots are

used sometimes� therein the sequence is arranged in a circle� and base pairs are

indicated as chords connecting the pairing nucleotides The non �pseudo�knot

condition implies that chords do not intersect in this representation A number

of equivalent representations have been developed for special applications Secon�

dary structures can be translated into rooted planar trees by mapping base pairs

to internal nodes and unpaired nucleotides to leaves �� This tree representation

is equivalent to a mountain representation �� obtained by indicating a nucleotide

which is paired with a base towards the �� end by a positive slope� a base with

pairing partner to the �� end by a negative slope� and an unpaired base by horizon�

tal line segment� Both the tree representation and the mountain representation

� � �


have been used for comparing secondary structures and for computing similarity

measures between di�erent secondary structures ��

The tree and the mountain representation are equivalent to a string representation

A base paired with a partner towards the ��end is denoted by "�� a nucleotide

pairing with a partner towards the ��end by "�� and "� is used for representing

unpaired positions Each secondary structure is then uniquely determined by a

string of length n taken from the alphabet "�� Not all strings formed from this

alphabet are valid secondary structures� biophysical constraints require that a

hairpin loop contains at least three unpaired bases and that each open bracket "��

has to be matched by a closed bracket "�� It is easy to see that the set of all valid

secondary structures of chain length n � � is in fact generated by the context free

grammar

G � S � "� j "�S j S"� j SS j "��S"��

This representation is used for I#O in the Vienna RNA Package It is also used in

this contribution Given a string like "�� its parse according to the grammar

G uniquely determines pairs of matching parentheses Similar grammars are used

for modeling tRNA structure families in �� Relating some sequence x to

some structure s� we call xi� xj to be in contact or matched ��and ��positions

when the positions i and j are a base�pair in s� ie they correspond to matching

parentheses in s

��Folding as a Combinatory Map

Natural RNA sequences are strings of length n over the alphabet fG�C�A�Ug The

canonical distance measure between two sequences of equal length n is the Ham�

ming distance �� counting the positions in which to end�to�end aligned sequences

di�er

� � �


The relation between sequences and structures is mediated by a folding algorithm�

see above� in our case the routine fold�� of the Vienna RNA Package In the

reminder of this paper we will write f�x� instead of fold�x� The biological

question of sequence�structure relations translates into the mathematical question

of determining properties of the mapping

f � Qn� � Yn �

where Qn� is the generalized hypercube �Hamming graph� of dimension n over an

alphabet of size � and Yn is the shape space of all secondary structures of length n

The shape space can be viewed as metric space by either using the trivial metric

or one of the distance measures based on the tree representation or mountain

representation of secondary structures

The term combinatory map was introduced for mappings from a discrete con�gu�

ration space into some metric space as a generalization of the notion of a landscape

� � It is clear that the folding map f is not necessarily injective� since there are

less than �n secondary structures as compared to �n sequences From the structure

statistics given in �� immediately follows that f is not onto in general� structures�

as derived from the grammar G� typically contain many isolated �non�stacked�

base pairs Consequently� the preimage f��s�� ie� the set of all sequences ac�

tually folding into s� will be in general a fairly large set The structure statistics

�� indicates� however� that only a vanishingly small fraction of all secondary struc�

tures obtained from folding contains isolated base pairs Consequently� the image

of the sequence space� f�Qn�� will in general be a proper subset of Yn

The obvious questions to ask about f are thus the following

� How large is the image of sequence space f�Qn��$

� How large are the preimages of given structures$

� What is the distribution of preimage sizes$

� How many sequences do not form secondary structures� ie� how large is the

preimage of the open structure$

� �


� How is f��s� embedded in sequence space$

Many of these questions have been at least partially answered in previous papers

describing non�exhaustive computer simulations �� as mentioned in

the introduction Based on these results a random graph model �� was concei�

ved that allows to construct the preimages of a �given� secondary structures s as

functions of a single parameter� the �average� fraction of neutral neighbors of the

sequences folding into s �see the section on neutral networks below�

The random graph model reveals the generic properties of sequence to structure

mappings based on base pairing It is based merely on the existence of some

de�nitions of legal pairings that need not be the complementary pairs �preferential

AA orGG pairs as observed in homo�DNA �� would be equally acceptable� The

exhaustive enumeration data presented here can be used to check the predictions

of the random graph model and they may help to detect and interpret systematic

deviations of data obtained by RNA folding from this idealized reference

��Compatibility of Sequences with Structures

The mere notion of base pairing implies an a priori relation between sequences

and structures which depends only on the legal base pairs in a given alphabet� but

which is completely independent of the particular values of the energy parameters

For instance� the legal base pairs in the natural AUGC alphabet are AU� UA�

GC� CG�GU� and UG Note that we neglect non�standard pairings such as AA�

GA� AG� or UU which have been observed in natural RNA structures ��

��

De�nition� A sequence x is compatible with the secondary structure s if �xi� xj�

is a legal base pair for all matched "�� and "��positions i and j �� Given a

secondary structure s we shall denote the set of all compatible sequences by C�s�

� �


Of course� the mere fact that x � C�s� does by no means imply that f�x� � s�

ie� that x really folds into the secondary structure s Trivially� we have however

f��s� � C�s�� ie� all sequences folding into a given secondary structure have

to be compatible with it The notion of compatible sequences is essential for any

deeper understanding of sequence structure relationships in RNA

��Neutral Networks

��Rearranging the set of compatible sequences

Since our main interest is the representation of f��s� in sequence space let us

�rst consider the geometry of the sets C�s� of compatible sequences in the sequence

space Qn� The structure of C�s� is complicated by the di�erence between paired

and unpaired positions C�s� does not in general give rise to a connected subgraph

of sequence space� but decomposes into hyperplanes which are characterized by a

particular choice of the base pairs at paired positions �Neighboring� hyperplanes

have either Hamming distance dH � �� for instance if GC is replaced by GU� or

Hamming distance dH � � if GC is replaced by any other legal pairing �AU� UA�

CG� UG� Note that GC�GU�AU gives rise to three hyperplanes which are

connected with each other%

Allowing for both point mutations and base pair exchanges rede�nes the neigh�

borhood relations in C�s� by introducing new edges We shall denote this graph

by C�s� It is clearly connected In the following we will examine its structure in

detail Given a seconary structure s� a sequence x � C�s� has any one of the �

letters at each position i for which i � "�� while positions j and k corresponding

to matching brackets "�� and "�� are positions that are occupied by any one of the

� legal base pairs Thus a sequence x � C�s� can be represented by the nucleotides

in its unpaired positions� and the encoding base pairs in terms of letters from an

� ��


�� structure

CGCCGGCGGCGCCCGGGGCC AUGGGUCUCCGACAGUCCGG sequence

C��GGCCGGC A��UCCAGUG base pairs reduced

�� unpaired bases reduced

�� reduced sequence

G�� C� A�� U��

GC�� CG� AU�� UA�� GU� UG��

Figure �� The e�ect of procedure reduce�

alphabet of size � at the positions of the open brackets All letters at the closed

brackets can then be deleted An example is given in Figure �

Later in this paper we will make use of the procedure reduce performing this

contraction of the sequence�

reduce�s��

reduce�s�� x� � �� reduce�s� x� �

reduce�s�� x� � �� reduce�s� x� and

reduce�s�� x� � reduce�s� x� �

Given the reduced sequence and the structure� one can of course reconstruct the

original sequence In fact� it is easy to write a function expand such that

expand�s� reduce�s� x�� x

for all s � Y and all x � C�s�

Reduced strings can be rearranged further� we �rst write the u unpaired positions

and then the p base pairs �n � u! �p� Thus we can interpret reduce�s� x� as an

element of the direct product space Qu� Q

p� � ie� C�s� is isomorphic to the product

� ��


of two Hamming graphs with possible distinct alphabets� which correspond to the

unpaired and paired positions� respectively As an immediate consequence of these

considerations we note the following

Lemma� Let x� y � C�s� The graphical distance of x and y in the graph C�s�

coincides with the Hamming distance of their reduced representations�

dC�x� y� � dH� reduce�s� x�� reduce�s� y� � �

In particular� there is an edge in the graph C�s� if and only if

dH� reduce�s� x�� reduce�s� y� � � � �

Using reduced sequences provides two advantages�

�� The storage requirements are reduced We can represent the preimage f��s�

now as the pair �s�X�� where X is the list of all reduced sequences folding into

s Note that the length of a reduced sequence is

jreduce�s� x�j � u! p � n� p�

In the case of the GC alphabet this reduces the memory requirements by

about ��

�� Comparison of two sequences is more e&cient� again because the representa�

tion is shorter

�� The distance between two sequences can be computed very e&ciently since it

coincides with their Hamming distance by the above lemma

� ��


��De�nition of neutral networks

Mutations which do not a�ect the �tness of an organism are called neutral in

biology By the same token� the term neutral is used in the context of RNA

to mean mutations which do not alter the �secondary� structure Hence f��s�

contains all the sequences which can be considered as neutral mutants of each

other

De�nition� The set f��s� considered as an induced subgraph of C�s� is called

the neutral network� N �s�� of the secondary structure s

The motivation for the term �network� will become clear later on De�ning N �s�

as induced subgraph of C�s�� rather than as induced subgraph of the sequence space

Qn� itself� avoids the peculiarities introduced by the logic of base pairing On the

other hand� the neighborhood relation no longer coincides with the action of mu�

tation Hence we have traded technical tractability for biophysical interpretation

The most important characteristic of a neutral network N �s� is its connectivity

In order to retain as much biologically relevant information as possible we consider

the unpaired and the paired part of the sequence separately and de�ne �u�s� and

�p�s� as the average fraction of neutral mutations in Qu� and Q

p� � respectively

��Neutral networks of RNA secondary structures

Let us brie�y summarize the properties of the neutral networks of RNA minimum

free energy structures

� The distribution of preimage sizes follows roughly a so called generalized Zipf�s

law� ie� their rank�order statistics follows a distribution function of the form

��r� � A�� ! rB��

where r is the rank�order of a structure� ��r� its frequency� A gives the ab�

undance of the most frequent structures� B � measures the number of

� ��


�frequent� structures� and � � determines the shape of the power�law tail

��

� A neutral path is a path fx�� x�� x�g in N �s� such that dC�x�� x��

ie� a neutral path if obtained by selecting a start sequence x� folding into

s and then successively choosing neutral neighbors in such a way that the

�graphical� distance to the starting sequence is increased with each accepted

step The length � of a neutral path is therefore a lower bound on the diameter

of the neutral network N �s� Computer studies based on RNA folding have

shown that the average values of � are much larger than the distance between

randomly chosen sequences for all common structures A precise de�nition

of a common structure will be given below� see sect �� In this sense� the

neutral networks of common sequences reach through all of sequence space�

or more precisely� the set of compatible sequences ��

� Inverse folding provides a means of estimating the distance from a random

sequence to the nearest sequence folding into a desired target structure s

One �nds that sequence space is covered with fairly small balls each of which

contains almost all common structures The radius R of these balls is only

slightly larger than the average distance from a random sequence to the set

of compatible sequences for any common structure s This property has been

termed shape space covering in ��

� The connectivities �p and �u become constant for large n when averaged over

large samples of randomly chosen sequences �� ie� RNA sequence�structure

maps are characterized by a very high degree of neutrality

��Random graph theory of neutral networks

The �ndings from the computational studies as listed above prompted us to search

for a generic statistical model �with as few parameters as possible� that could

explain the data As a �rst step we have recently proposed a random graph model

explaining structures and properties of individual neutral networks ��

� ��


As already stated C�s� is the direct product of the hypercubes Qu�Q

p� and proceed

by constructing random subgraphs in each of the graphs Qu��Q

p� separately This

is done by selecting each vertex of Qu�� Q

p� with independent probabilities �u and

�p Here �vertex in Qu�� means the �unpaired� part of the sequence x and �vertex

in Qp�� means the �paired� part of the sequence x �that is compatible wih s�

Thereby one obtains randomly induced subgraphs 'u Qu��'p Qp

� and the

neutral network N �s� is given by

N �s� def

��'u 'p �

The random graph approach does not deal with speci�c biochemical or biophy�

sical features of the folding process by doing two random selections� one for the

unpaired part of a sequence and the other for the paired part A sequence folds

into s with probability �u �p However� the basic parameters have a biochemi�

cal interpretation namely to be for the �unpaired� part the expected fraction of

neutral sequences in Hamming distance one and for the �paired� part the fraction

of neutral neighbors for simultaneous base pair exchanges In other words the

parameters re�ect the stability of the structure s under point mutations in the

�unpaired region� and base pair exchanges �ie� mutations that preserve compa�

tibility� in the �paired regions�

Table �� Asymptotic values for the fractions of neutral neighbors �u and �p

� Model I Model II Alphabet RNA Structures

��d ��c ��d � ��c unpaired paired

� �� GC �� AU ��

� �� GCXK �� GCAU ��

� �� GCAU ��

� ��


The theory of randomly induced subgraphs of sequence spaces predicts that� in

the limit of long sequences� there exists a critical value �� such that a neutral

network is a dense and connected subgraph of C�s� if � �� A subgraph G of a

�nite graph H �G H� is dense in H if and only if each vertex of H is either in

G or has at least a neighbor in G Neutral networks show a typical percolation

phenomenon in sequence space Moreover for all �u� �p � there exists in the

limit of long sequences a so called giant component i e most of the sequences are

pairwise connected in C�s� Conversely� the network is �in its projections Qu��Q

p��

neither dense nor connected if its ��value is below the critical value A later and

re�ned version takes into account the di�erent connectivities for unpaired and

paired positions Table � compiles the critical connectivities for the two variants

of the random graph model as well as for sequences from di�erent alphabets The

predictions of the random graph model concerning the connectivity or percolation

problem and the density of subgraphs were checked by exhaustive ennumeration

The data shown in table � indicate in full agreement with previous studies ��

that neutral networks of structures GC�sequences are more likely to fall below the

percolation threshold than those derived from AU� or AUGC�sequences

� ��


�� Exhaustive Enumeration of Secondary Structures

��Folded Secondary Structures

��An upper bound to the number of folded structures

Of course� the number of folded secondary structures cannot exceed the number Sn

of possible secondary structures Sn can be calculated from the simple recursion

Sn � Sn�� !n��Xk�m

Sk Sn�k�� n � m! � S� � S� � �� Sm��

wherem is the minimum number of unpaired digits and which follows directly from

the grammar G� see �� In the biophysically relevant case m � � the asymptotics

of Sn are given by�

S��n �� n�� n

As mentioned above� isolated base pairs are extremely rare in folded secondary

structures Hence� a better estimate can be obtained by counting the number

of secondary structures S��n which do not contain isolated base pairs� ie� all

base pairs are contained in stems of length at least two A recursion and an

asymptotic expression for this series was recently derived �� It turns out that

S��n is signi�cantly smaller than �n�

S��n �� n�� n

Most structures counted this way still exhibit much shorter helices than average

folded structures We expect therefore that S��n still overestimates the number of

folded secondary structures

� ��


��A lower bound to the number of folded structures

A non�trivial lower bound can be obtained from estimating the size of the set

of sequences compatible to a given structure s Indeed� we know that �� of

all sequences fold into structures with at least p�n base pairs� where p� is some

constant with � p� � independent of n� see �� Therefore� �� of the �n

sequences fold into structures which have compatible sequences of size at most

jC�s�j � ��p��p�n � �n �

��

��

�p�n

where � � �� is the number of di�erent legal base pairs in the given alphabet

Table ��Lower Bounds on the Number of Folded Structures

Alphabet � � p� r � ��p�

GC � � �� AU � � �� AUGC � � �� AUGC� � � �� GCXKa � � �� ABCDEFa � � ��

� GU pairs suppressed�a These articial alphabets contain two �three� complementary base pairs with identical energyparameters�

Even if all sequences in C�s� would fold into the same structure� and if all compa�

tible sequences were disjoint� there must be at least

Sl�b�n �

�

�

��

�

�p�n

di�erent secondary structures Numerical estimates for p� can be obtained from

the data for the average number of base pairs in folded secondary structures � ��

�It was in fact shown that the opposite is true �� The compatible sets of any two secondarystructures have a non�empty intersection�

� � �


see table � The number of base pairs is concentrated around p�n for large n

Even if mean and median would di�er signi�cantly in this distribution� it would

only lead to a �minor� correction of the pre�factor �� without a�ecting the expo�

nential part of the bound Sl�b�n It is interesting to note that r � ��

��p� does

not strongly depend on the alphabet Just as the upper bound discussed in the

previous subsection� this lower bound is far from being sharp as well

��Exhaustive structure generation

Because of the intractably larger numbers of sequences over alphabets with more

than � � � letters� we restrict our numerical investigations to two letter alphabets

We choose two examples GC and AU In the latter case the data are expected

to re�ect the short�sequence e�ects much stronger because the energy parameters

of AU base pairs are such that sequences of tractable length are very likely to be

unfolded

The sequences are generated in collections of �ve million sequences and their re�

spective complements each The sequences are folded using fold�� and then

grouped according to their structures After a section is completed� all sequences

x folding into the secondary structures are appended to the UNIX �le FILE�s�

Sequences are stored in ��bit machine words� with � and � replacing G and C�

respectively This method of storage� which limits the sequence length to �� does

not restrict the approach in praxis� since a length of n � �� is already above

the limits of both the accessible CPU resources and accessible storage capacities

In fact� longest RNA molecules we have investigated have chain length n � ��

This computation required about �� days of CPU time on an IBM RISC ��

workstation with �� MB RAM� and more than �� GB of disc space

The bounds on the number of secondary structures were important for organizing

the disc storage of the data In the worst case� the upper bound given above leads

us to expect about a million di�erent secondary structures for n � �� This is by

� � �


far more than the number of �les that can be contained in a single directory of

the UNIX �le systems on our workstations In order to overcome this restriction�

the �les are arranged in a hierarchical directory structure limiting the number of

�les per subdirectory to about �� For n � �� this directory structure alone

occupies about �� MB of disc space

��Numerical Estimates From Exhaustive Enumeration

��The number of minimum free energy structures

The upper and lower bounds on the number of secondary structures� which we

have discussed in the previous sections have been based on fairly crude estimates

Much more accurate estimates for the actual number of di�erent structures that

are realized by a particular folding algorithm can be obtained by extrapolation

from exhaustive counts for short chains

Linear regressions of a logSMFEn versus n plot of the data shown in �gure � yields

the following estimates

SMFEn �� n for GC�

SMFEn �� n for AU�

The estimate is quite good for the GC alphabet� while for the AU alphabet �nite

size e�ects seems to dominate This is due to the weaker base pairs� which leave

a large fraction of the sequences unfolded in their minimum free energy structure

The exact numbers of di�erent structures can be found in table �

The abundance of open structures �of the type �� is therefore a good indicator

for the in�uence of �nite size e�ects For GC we �nd a distinct exponential

decrease in the fraction of sequences that do not form base pairs� see �gure �

� ��


0 5 10 15 20 25 30chain length n

100

101

102

103

104

105

106

|S|

0 5 10 15 20 25 30chain length n

10-5

10-4

10-3

10-2

10-1

100

Fra

ctio

n of

Ope

n S

truc

ture

s

Figure �� Numbers of structures and abundance of the open structure obtained by exhaustiveenumeration computed for GC� �� and AU� � alphabets� Numbers of structures jSjare presented as functions of the chain length �l�h�s�� The fractions of sequences thatfold in to the open structure are shown on the r�h�s�

In the AU case almost half of the sequences with a chain length of n � �� are

shorter do not form base pairs Consequently we cannot expect reliable estimates

for the asymptotics of SMFEn in this case We �nd indeed that number SMFE

n

is much smaller for both alphabets then the combinatorial estimate discussed in

sect ��

��Common structures

De�nition� A structure s is said to be common if its preimage f��s� is not

smaller than the average size of a neutral network �� that is in mathematical

notation� if

jf��s�j �Xs��S

�

jSjjf��s��j �

�n

jSj�

� ��


Table ��Common Structures and their Preimages

GC AU

n jSj rc nc jSj rc nc� � � ��

� All sequence fold into the open structure�

Furthermore� let rc denote the rank of the rarest common structure� ie� the

structure with rank rc is common� but the structure with rank rc ! � is not

common

The number of sequences that fold into common structures� nc is listed in Table �

Two quantities are of particular interest� the fraction rcjSj of common structu�

res within the set S of all structures that are obtained by folding� �gure ��lhs��

� ��


and the fraction nc�n of sequences that fold into common structures� see ��

gure ��rhs�

0 5 10 15 20 25 30chain length n

0.01

0.10

1.00

ρ

0 5 10 15 20 25 30chain length n

0.70

0.75

0.80

0.85

0.90

0.95

1.00

ν

Figure �� Fraction of common structures and fraction of sequences folding into common struc�tures computed for GC� �� and AU� � alphabets� The chain length dependenceof rc�jSj � � �l�h�s�� seems to decrease with increasing chain length whereas thechain length dependence of nc��n � � �r�h�s�� strongly indicates convergence towardslimn��

The fraction of common structures among all minimum free energy structures

decreases with chain length The data for the GC alphabet as consistent with

an asymptotically exponential decrease of rcjSj The data for the AU alphabet

exhibit a minimum at n � �� A trend for long sequence� however� cannot be read

o� the data In analogy to the results for GC sequences and based on the Zipf�s

law type distribution of preimage sizes we expect� that rcjSj approaches � as n

becomes large

The fraction of sequences that fold into common structures slowly increases in

the case of GC sequences( for n � �� we have nc�� ie� less than ��

� ��


of all sequences fold into non�common structures For AU we �nd a minimum

at n � �� with about �� of the sequences folding into common structures� and

a steep increase of this fraction for longer chains The data are consistent with

the conjecture that asymptotically almost all sequences will fold into common

structures

��The Distribution of Preimage Sizes

The distribution of preimage sizes has been studied previously for coarse grained

structures on sequences up to length �� by folding samples of several million

random sequences �� These studies� which access the distribution of preimage

sizes in the realm of common structures and the onset of the tail have suggested the

distribution follows a generalized Zipf�s law �� which can be parameterized

as ��

��r� � A�� ! rB�� with B � and � ��

see section ��

The exhaustive folding of the entire sequence space allows us to compare this model

distribution with the actual rank�order statistics of RNA secondary structures We

�nd qualitatively similar distributions with few frequent structures and a long tail

of very rare structures Quantitatively� the �t to the model function is rather poor

mostly because frequencies should be nearly constant at r � �� ie the model

function exhibits an even sharper transition from common to rare structures see

�gure � This may be partly a �nite size e�ect� since the transition clearly becomes

sharper with increasing chain length Already at chain length �� more than ��

of GC sequences fold into structures with r rc � �� ie the most frequent

�� of structures

� ��


100

101

102

103

104

105

Rank

10-9

10-7

10-5

10-3

10-1

Fre

quen

cy

Figure �� Distribution of preimage sizes for GC sequences length �� The dashed line is a tto the model above using A � �� B � �� The dot�dashed line is theintegral over the distribution� the vertical line marks rc � ��

�� The Sequence of Components

��Algorithmic Considerations

In modeling the sequence�structure map of RNA secondary structures on the com�

puter� we distinguish three levels of abstraction�

�� The problem level� sequences� structures� neutral networks and related noti�

ons introduced in the previous sections

�� The algorithmic level� lists� graphs and other data structures representing

the concepts of the problem level for algorithmic analysis

� ��


�� The storage level� machine words and UNIX �les for storing the data struc�

tures of the algorithmic level

The correspondence between the problem level and the algorithmic level is the

content of the formal �mathematical� representation of the underlying biophysics

The correspondence between algorithmic and storage level is established by a pair

of inverse functions that write to and read from background storage The enormous

volume of the data in this investigation mandates a careful choice of algorithms

and data structures

By de�nition� a neutral network N is a graph with node set fx j f�x� � sg and

an edge between a prir of nodes x and y if they di�er by a single character in

an unpaired position� or by a single switch of complementary characters in pai�

red positions The edges of N cannot be represented explicitly� since the space

requirements would be excessive A simple routine neighbour�x� y� can decide

whether �x� y� is an edge according to the above criterion� while N is just repre�

sented by its node set as a list of sequences Correspondingly� N is stored as a

UNIX �le of consecutive machine words Furthermore� the exhaustive generation

of the sequences and folding them into structures does not produce immediate

neighborhood information To �nd the neighbours of x in N �s� takes O� jN �s�j �

steps� and so the computational e�ort for the analysis of connected components of

N �s� is quadratic in jN �s�j Some networks contain millions of nodes and hence

this approach is infeasible

We now turn to the problem of �nding the neighbours of x � N �s�� which is

the basic primitive for determining connected components in a graph This is a

well�studied problem in computer science However� textbook algorithms usually

assume an explicit representation of edges� such that an edge �x� y� e�ectively

determines the neighbour y of x

� ��


��Tries

With N being a list of nodes� our problem is rather a string processing problem

and we can make use of standard string processing techniques ��

De�nition� A trie� is a rooted tree with edges labeled by characters� such that

all outgoing edges of a node carry di�erent labels

A Btrie of depth n is a trie over a binary alphabet where all paths to a leaf have

the same length n Edge labels are optional by the convention that a left branch

indicates �� a right branch indicates �

1

0

0

0

1

1 1

1

1

0

0

0

0

e

th

e

e

h

r

r

i

er s

i

Figure �� Two examples of tries� A trie for fhere�his�their�there g �l�h�s� and a trie for f �� g �r�h�s��

A trie represents a set of strings by its paths read from the root towards the leafs

Figure � shows two tries A Btrie of depth n can represent up to �n sequences of

length n Yet independent of its size� it is decidable in constant time �n steps at

most� whether some sequence is stored in a trie Therefore tries are ubiquitous

data structures in text processing Putting together these ideas� a neutral network

N �s� is now represented in the algorithmic model as a pair �s� t�� where t is the

Btrie representing freduce�s� x� jx � N �s�g

A Btrie representing m sequences of length n can be constructed in O�mn� steps

Although this construction is optimal� it is too expensive to be carried out each

��Trie� is an articial word between �tree� and �retrieval�� It is usually pronounced to rhymewith �pie��

� ��


time the data are read from the disk Moreover� tries provide a kind of pre�x

sharing for the strings they represent For example� the trie in Figure � �rhs�

represents � strings of length �� but has only �� nodes Thus further savings in

time and space arise if we compactly represent networks as tries on background

storage

The function writeBtrie recursively traverses a Btrie in preorder and emits

� for a node with two subtrees�

�� for a node with a left subtree only�

�� for a node with a right subtree only

Leafs produce no output� making use of the fact that all sequences in the Btrie

are of a known �reduced� length n� �being identical with the depth of the trie��

as indicated by the corresponding structure This establishes a linear encoding

of a Btrie t in a pre�x code� and we can easily implement the inverse function

readBtrie such that

readBtrie�n�� writeBtrie�t�� t �

For example� let t be the Btrie of Figure � �rhs� We have

writeBtrie�t� � �� and readBtrie�� t�

Table � summarizes our algorithmic and storage model of sequence structure maps

With these techniques� the total space required for storing all neutral networks is

reduced from �� GB to � MB� an overall reduction of � �

� � �


Table � Summary of Data Representations

Problem level Algorithmic level Storage levelsequence x x � f�� g�� machine word � �� bitsstructure s s � Language�G� ASCII stringcombinatory map f fold�� Algorithm C�Codeneutral network �s� t�� t Btrie representing Unix �le containing

N freduce�s� x� jx � Ng �s�writeBtrie�t��edge �x� y� in N implicit if

dH�reduce�s� x�� reduce�s� y�� Sequence of �s� t�� t�� tu� t�� with Unix �le containingcomponents ti representing Ci as Btrie �s� writeBtrie�t��

C�� C�� Cr t� Btrie of pseudo component

C� writeBtrie�tn��writeBtrie�t��

��Connected component analysis

Given two sequences x and y in N �s�� it is of special interest whether there exists

a path from x to y Equivalently we may ask whether x and y belong to the same

connected component of N �s� Computing a complete decomposition of N �s�

into a sequence of components� stored as a list of Btries� makes available an entire

solution for the whole network and further on a model describing size and quantity

of components in neutral networks

A simple decomposition algorithm starts with the extraction of an initial sequence

x from N �s� Finding and removing all neighbors of x from N �s� is the the

essential part of the computation This step is detailed below The neighbors

of x are temporarily stored in a list L while the initial sequence x can be saved

in a Btrie C representing the arising component Consecutively we take the �rst

element l� from L� compute and remove its neighbors from the decreasing N �s�

and append them to L Delete l� from L� then repeat the procedure

� � �


remove neighbors and append them to L

neighborsfind

C1C2C3

l1 l l l l l2 3 4 5 6

save in Btrie C3l5

L

(s)N

Figure � A snapshot from component analysis� Components C� and C� are completed� while

C� is under construction� C�� C�� L andN �s� are disjoint and their union constitutesthe original neutral network of s� Once the components are stored on disk they aredeleted from N �s��

Finding L empty stops the procedure with the resulting connected component

stored in the Btrie C A complete decomposition of N �s� is achieved by rest�

arting the process with new initial sequences and extracting the corresponding

components until N �s� is empty We get a sequence C�� C�� Ck of components

as a list of Btries that are stored in the usual way Space optimization without

extra work is attained by storing all components consisting of only one sequence

in one pseudo�component C� Finding and removing all sequences neighboring to

a sequence x is done by the function findrem speci�ed by the equations ��

below If findrem�d� x� t� � �t�� x�� xr�� then dH�x� xi� � d� and t� is the Btrie

obtained from t by removing x�� xr

A sequence x has at most n neighbors� and the e�ort of �nding and removing

them is O�n�� For jN �s�j � m� the complete algorithm is O�m � n�� in the worst

case For dense networks with m � �n� this approaches O�m � �log m�� making

connectivity analysis feasible

� ��


�� findrem�� Leaf� � �Empty� �� ( neighbor found and removed

�� findrem�n�w� Empty� � �Empty� � ��( no neighbors in empty Btrie

�� findrem�� aw� Node�l� r��

��red� Node�l�� r��

�map��xs� !! map�� ys� �� where

�l�� xs� � findrem�a� w� l� and

�r�� ys� � findrem�� a� w� r�

�� findrem�� w� Node�l� r�� red�Node�l�� r�� map�� xs�� where

�l�� xs� � findrem�� w� l�

�� findrem�� w� Node�l� r�� red�Node�l� r�� map�� ys� �� where

�r�� ys� � findrem�� w� r�

�� red�Node�l� r��

�Empty if l � r � Empty

Node�l� r� otherwise�� map�a� �w�� wn�� aw�� awn� for n � �

Figure � Specication of the function findrem�

��A remark on programming methodology

Implementing connectivity analysis on the computer required close cooperation

between biologists and computer scientists Such interdisciplinary work has its

own challenges In the early state� it is essential that the mutual understanding

is validated and exercised by a quick implementation of algorithmic ideas and an

exploration of alternatives In our case Miranda�� a purely functional language was

chosen to prototype the algorithmic model Due to the high level of abstraction� a

functional program for the above analysis is very short �only � pages� and clearly

re�ects the design decisions Only after our ideas had been tested and their bene�ts

on space and time requirements had been demonstrated� the implementation in C

was developed

�Miranda is a trademark of Research Software Ltd�

� ��


��A Sample Session GC��

The raw data� represented by an exhaustive list of all sequences contained in dif�

ferent �les according to their structure are generated using the command

ShapeSpace �l �� g Here �l is an option for the chain length and �g means

generation of the raw data This command creates a hierarchy of directories con�

taining one �le for each structure The entries are then sorted and the index �le

GC ��ind is created in which each line contains the rank of a structure� the size

of its preimage� the dot�bracket representation of the structure and the location

of the actual data in the directory tree

The computation of the sequence of components is performed using the command

ShapeSpace �l �� d It requires that the index �le has been created �rst The

data are written into a hierarchy of directories such that each directory contains

no more than �� entries For instance� the data belonging to the structure of

rank �� in GC�� will appear in the directory GC �� with �lename ��

For long chains this presents a formidable task The computational bottle necks

are the following�

�� CPU resources are crucial for the folding and for the sequence of components

decomposition

�� Memory is the limiting resource while the raw data are generated

�� Access to disk space becomes a problem in the case of longer sequences as

partial results have to be moved to and from scratch disks

The �� sequence of the sequence space GC�� fold into �� di�erent secondary

structures The most frequent structure is the open structure �� which

is obtained from about �� of all sequences The neutral network of the open

structure is exceptional in the case of short chains because it is connected For

longer chains it decomposes into �at least� two patches� one of which is centered

around poly�G and the other one centered poly�C

� ��


Table �� Decomposition of the Sequence Space GC��

Rank Size Sequence �u �p Sequence of

Components

� ��

��

Common structures have at least d��e � �� realizations in this example� therefore rc � ��

The most frequent structures have connected preimages consisting of only one

�single� component The neutral networks of the less frequent common structures�

� ��


however� decompose into two components The reason for this will be discussed

in the following section

��Connectivity in GC��

An analysis of the sequence of components of GC�� can be used to test the

random graph theory discussed in section �� It predicts that a neutral network

corresponding to a structure with �u �� and �p �� should form a single

connected component� at least in the limit of long sequences Since a chain length

of n � �� is far away from this limit we cannot expect to �nd a perfect picture

The data show two types of deviations from the predicted behavior�

�� Many neutral networks consist of one �or a few� large components and a few

isolated points �or very small components� The overwhelming fraction of the

sequences belongs to the larger components We can attribute the existence of the

isolated points and tiny components to �nite size e�ects

�� The splitting of neutral networks in a small number of large components of

comparable size� however� is not consistent with the prediction of the random

graph model In order to explain this behavior we need to consider structural

features in some more detail A few typical examples can be found in table �

Structures with �u and �p above the threshold values show one� two or four �almost

equally sized� components when they have none� one or two structural elements

that allow to form additional base pairs ��gure �� These elements are� for ex�

ample� stacking regions with two dangling ends� hairpin loops with �ve or more

members or su&ciently large bulges� internal loops or multi�loops Structures with

these elements are very likely to form the additional base pair whenever comple�

mentary bases are in the appropriate positions This is more likely to be avoided

�and the structure in question is more likely to be formed as minimum free energy

� ��


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0λu

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0λp

Figure �� Types of neutral networks in GC�� The color code indicates the �type� of neutralnetworks� depending on the values of �u and �p� grey� networks consisting of morethan � large components� and very small networks� corresponding to less than ��sequences� A component is considered as large in this plot if its size is at least �� ofthe size of the largest component� Red� one �single� component� apart from �possibly�a few tiny components� green� essentially two components� blue� three components�yellow� four components� If more than one data point falls within a square then thecolor that appears most often is shown�

structure from the corresponding compatible sequence� when there is an excess of

G or C� �)G or �)C respectively� in the base composition of the sequence �the

G�C ratio being signi�cantly di�erent from one� One structural element that

allows to form an additional base pair thus implies a bias towards more G or more

C in the average composition of the sequences just as it is found with the two

components in a two�component neutral network ��gure �� lhs� Inspection of

table � shows that precisely the same types of structural elements are responsible

for the appearence of two equally sized componenets also in the case of these very

short sequences

� ��


Table � Selected Sequences of Components in GC��

Rank Structure �u �p Sequence of Componentsy

� ��

�� j f�� j � �� in �� components�


� �� j f�� j � �� in �� components�


� A graph of the distribution of component sizes can be found in ��y Very small components are not shown in detail here� A number in square brackets gives thetotal number of sequences in them�Data are taken from reference ��

0 10 20 30C Content

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Fre

quen

cy

component 1component 2

0 10 20 30C Content

0.00

0.02

0.04

0.06

0.08

Freq

uenc

y

component 1component 2component 3component 4

Figure �� Large neutral networks can decompose into a small number of components withalmost equal sizes which di�er by their G�C ratio� A neutral network with two com�ponenets as exhibited by the structure with rank �see table � is shown on the l�h�s�On the r�h�s� we present a neutral network with four components as found with thestructure of rank ��

Two independent structural elements are superimposed in the relative G�C con�

� ��


tent and thus we have

�)G* �)G � �)G* �)C � �)C* �)G and �)C* �)C

The �rst and the last combination show a net G or C bias whereas compensa�

tion brings the other two cases back into the middle of sequence space Exactly

this distribution of the four components is found by analyzing the data of the

reference sample GC�� Three components commonly represent a special case of

four components where the two central components happen to be connected

For structures with �u and �p way below the critical values we �nd many com�

ponents and a characteristically decreasing size distribution of these components�

see �gure � as predicted by random graph theory A more detailed analysis of the

relationships between the parameters �p and �u and the particular features of the

corresponding secondary structure graphs will be published elsewhere

� ��


�� Shape Space Covering

Extensive computational studies on RNA sequence structure maps �� pro�

vide strong evidence for the existence of a neighborhood �represented as a high�

dimensional ball in sequence space� with a radius much smaller than the chain

length around every random sequence such that this neighborhood contains se�

quences whose structures include almost all common shapes Full enumeration of

a sequence space allows us of course to check this conjecture� and provided it is

true� to estimate the �covering radius�

To this end we have exhaustively generated all sequences within a ball of radius

R in sequence space The covering fraction ��R� r� is de�ned as the number of all

structures with rank not exceeding r encountered within this ball divided by r�

the rank of the rarest structure under consideration

Since the shape space covering conjecture claims that all common structures can

be found within a small radius R�� the covering fraction ��R� rc� is of particular

interest The data shown in the lhs of �gure �� are obtained by averaging

��R� rc� over �� balls with randomly chosen midpoints

The ��R� r� plots can be �tted very well by an error function

��R� r� ��

�erf �a�R�R��

The parameters R� and a correspond to the threshold value and the slope at the

in�ection point� which measures the sharpness of the transition The choice of the

error�function in �gure �� is arbitrary In order to check the robustness of the

data with respect to the choice of the �tting curve we have used

+��R� �exp� a�R�R��

� ! exp� a�R� R��

as an alternative The data in table show a very good agreement of the R� values

obtained from � and +�� respectively

� � �


0 1 2 3 4 5 6 7 8 9 10Radius R

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Cov

erin

g F

ract

ion

1 10 100 1000 10000 100000Rank r

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

Cha

ract

eris

tic R

adiu

s

R*

Figure �� Shape space covering ofGC sequences� The covering fraction �R rc� for sequenceswith di�erent chain length from n � �� left most curve� to n � �� right most curve� isshown on the r�h�s� On the l�h�s� we present the characteristic radius R� as a functionof the maximum rank up to which the structures are considered� The plot shows resultsfor chain lengths from n � �� lowest curve� through n � �� top most curve��

Table �� Shape Space Covering Radii for GC Sequences

n rc ,R �� ,R � +�� ,R� +�� ,R��

The characteristic radius R� will depend on the maximum rank r up to which

� � �


structures are considered It turns out� however� that only the most frequent

structures have a drastically reduced value of the characteristic radius R�� see the

rhs of �gure �� Since the shape space covering conjecture claims that almost

all of the rc common structures can be found within a fairly small radius we refer

to the characteristic value belonging to rc as the covering radius ,R

Table presents the covering radii ,R for di�erent chain lengths A linear regression

yields

�� n � �� from �� n � �� from +��

The slopes of both � and +� vary only slowly with n The data are consistent with

a logarithmic model

�� log�n� ! �� from �� log�n� ! �� from +��

Thus the transition becomes sharper with increasing chain length� at least if one

considers the scaled quantity Rn In other words� there is a constant �� such

that� in the limit of large chain lengths� we have complete covering if the scaled

radius Rn exceeds �� while only a negligible fraction of the structures is covered

by balls of size Rn �� The constant �� depends of course on the alphabet in

question For GC the data reported above are consistent with ��GC � ��

Alternatively� the covering radius can be estimated by measuring the minimum

distance that is necessary to �nd a given structure from a chosen starting sequence�

and averaging over the starting sequences and target structures weighted by their

preimage sizes An upper bound for this mean covering radius can be obtained

from computer experiment even for larger chain lengths ��

A lower bound for the covering radius can be estimated as the mean number of

base pairs separating an arbitrary sequences from one that is compatible with some

given structure� for a typical structure s with b base pairs ��p�b point mutations

are needed on average to go from an arbitrary sequence x to a compatible sequence

� ��


Table �� Lower and upper bounds for shape space covering radius

n AUGC GC AU

��

Data taken from ��

0 20 40 60 80 100 120Chain lenght n

0

5

10

15

20

25

30

Cov

erin

g ra

dius

from enumeration from inverse folding lower bound

Figure �� Shape space covering radii estimated by di�erent methods for GC sequence spaces�� represents the data from table �� triangles refer to data taken from table ��

x� � C�s�� where p � �� is the probability that two randomly bases can form a

base pair

The covering radii obtained from the exhaustive enumerations �Table � and the

upper bounds �� Table � are in excellent agreement Figure �� shows the chain

length dependence of the covering radii obtained by the di�erent methods The

slight di�erence between the computed slopes for the upper bound� ��

� ��


Table �� Alphabet dependence of ��

Estimate AUGC GC AU

Lower Bound �� Upper Bound �� Enumeration � ��

� Exhaustive computations are not feasible for four�letter alphabets

as compared to �� from the enumeration results may be due to �nite size e�ects

� ��


�� Conclusions

Exhaustive folding of all sequences of RNA molecules of given chain length n is

usually prohibitive because of the hyperastronomically large numbers of sequences

The only known exceptions are the secondary structures formed by short sequences

�n � �� derived from binary alphabets �GC andAU� In these cases the numbers

of sequences do not appreciably exceed �� and there are less than a quarter of

a million di�erent structures Creation of reference samples by exhaustive folding

and retrieval of the desired information� nevertheless� requires special methodo�

logies even in these examples with short chain lengths The technique of tries is

applied successfully and several issues could be readily addressed Examples of

such issues are the elaboration of a clear de�nition of common for RNA secondary

structures� the veri�cation of shape sapce covering and the sequence of components

in neutral networks of RNA secondary structures The shape space covering ra�

dius obtained by exhaustive enumeration agrees very well with a previously derived

upper bound from structure statistics

Exhaustive enumeration is shown to be a highly useful tool for the creation of exact

samples that can be used to verify results obtained from model assumptions and to

prove conjectures derived from statistical approaches It has to be stressed� howe�

ver� that it is limited to short chain lengths and to two letter alphabets Otherwise�

the numbers of sequences certainly are prohibitive for two reasons� computer time

and storage capacities On the other hand� the numbers of �minimum free energy�

RNA secondary structures is certainly not larger in case they are derived from

four letter alphabets We can expect therefore that statistical methods will be

applicable directly to derive results analogous to those of exhaustive enumeration

for GCAU sequences of small chain lengths Investigations along these lines are

under way

� ��


Acknowledgements

The work in Vienna has been supported �nancially by the Austrian Fonds zur

F�orderung der wissenschaftlichen Forschung �Projects No ��PHY and No

��MAT Support of research in Jena by the Commission of the European

Communities is gratefully acknowledged �CEC Contract Study PSS�� Our

research was also continuously supported by the Santa Fe Institute

� ��


References

�� Peter Schuster How to search for RNA structures theoretical concepts in

evolutionary biotechnology Journal of Biotechnology� ��

�� W Fontana� D A M Konings� P F Stadler� and P Schuster Statistics of

RNA secondary structures Biopolymers� ��

�� W Fontana� T Griesmacher� W Schnabl� PF Stadler� and P Schuster Sta�

tistics of landscapes based on free energies� replication and degredation rate

constants of RNA secondary structures Monatshefte der Chemie� ��

� � � �

�� Ivo L Hofacker� Walter Fontana� Peter F Stadler� Sebastian Bonhoe�er� Man�

fred Tacker� and Peter Schuster Fast folding and comparison of RNA secon�

dary structures Monatshefte f� Chemie� ��

�� Peter Schuster� Walter Fontana� Peter F Stadler� and Ivo L Hofacker From

sequences to shapes and back� A case study in RNA secondary structures

Proc�Roy�Soc��London�B� ��

�� Sewall Wright The roles of mutation� inbreeding� crossbreeeding and selection

in evolution In D F Jones� editor� Int� Proceedings of the Sixth International

Congress on Genetics� volume �� pages ��

�� M Eigen� J McCaskill� and P Schuster The molecular Quasispecies Adv�

Chem� Phys��

�� Edward D Weinberger Correlated and uncorrelated �tness landscapes and

how to tell the di�erence Biol�Cybern��

� � Walter Fontana� Peter F Stadler� Erich G Bornberg�Bauer� Thomas Gries�

macher� Ivo L Hofacker� Manfred Tacker� Pedro Tarazona� Edward D Wein�

berger� and Peter Schuster RNA folding and combinatory landscapes Phys�

Rev� E� �� March � �

� ��


�� Peter Schuster and Peter F Stadler Landscapes� Complex optimization pro�

blems and biopolymer structures Computers Chem��

�� Christian Reidys� Peter Schuster� and Peter F Stadler Generic properties

of combinatory maps neutral networks of rna secondary structures Bull�

Math� Biol�� submitted( SFI�preprint ��

�� D Sanko�� A�M Morin� and RJCedergren The evolution of �S RNA se�

condary structures Can�J�Biochem��

�� T R Cech Conserved sequences and structures of groupI introns � building

an active site for RNA catalysis � a review Gene� ��

�� Danielle A M Konings and Pauline Hogeweg Pattern analysis of RNA

secondary structure similarity and consensus of minimal�energy folding

J�Mol�Biol��

�� S�Y Le and M Zuker Common structures of the �� non�coding RNA in

enteroviruses and rhinoviruses thermodynamical stability and statistical si�

gni�cance J�Mol�Biol��

�� M Zuker and D Sanko� RNA secondary structures and their prediction

Bulletin of Mathematical Biology� ��

�� M Zuker The use of dynamic programming algorithms in RNA secondary

structure prediction In Michael S Waterman� editor� Mathematical Methods

for DNA Sequences� pages �� CRC Press� �

�� H M Martinez An RNA folding rule Nucl�Acid�Res��

�� Manfred Tacker Robust Properties of RNA Secondary Structures PhD thesis�

University of Vienna� � �

�� John S McCaskill The equilibrium partition function and base pair binding

probabilities for RNA secondary structure Biopolymers� � ��

�� Ruth Nussinov� George Piecznik� Jerrold R Griggs� and Daniel J Kleitman

Algorithms for loop matching SIAM J� Appl� Math��

� ��


�� A A Mironov� L P Dyakonova� and A E Kister A kinetic approach to the

prediction of RNA secondary structures Journal of Biomolecular Structure

and Dynamics� ��

�� A A Mironov and A E Kister RNA secondary structure formation during

transcription J� of Biomolecular Structure and Dynamics� ��

�� M Zuker mfold��

ftp��snark�wustl�edu�pub�mfold�sgi��tar�Z �Public Domain Soft�

ware�

�� I L Hofacker� W Fontana� P F Stadler� L S Bonhoe�er� M Tacker� and

P Schuster Vienna RNA Package

ftp��ftp�itc�univie�ac�at�pub�RNA�ViennaRNA�� Public Domain

Software�

�� W Salser Globin messenger RNA sequences � analysis of base�pairing and

evolutionary implications Cold Spring Harbour Symp� Quant� Biol��

� ��

�� Susan M Freier� Ryszard Kierzek� John A Jaeger� Naoki Sugimoto� Marvin H

Caruthers� Thomas Neilson� and Douglas H Turner Improved free�energy

parameters for predictions of RNA duplex stability Proc� Natl� Acad� Sci�

USA� �� Dec � �

�� John A Jaeger� Douglas H Turner� and Michael Zuker Improved predictions

of secondary structures for RNA Proc�Natl�Acad�Sci� USA Biochemistry�

�� October �

�� Manfred Tacker� Peter F Stadler� Erich G Bornberg�Bauer� Ivo L Hofacker�

and Peter Schuster Robust properties of RNA secondary structure folding

algorithms In preparation� � �

�� Walter Fontana and Peter Schuster A computer model of evolutionary opti�

mization Biophys�Chem��

� ��


�� Danielle A M Konings Thermodynamic parameters for rna folding Private

Communication� � �

�� Pauline Hogeweg and B Hesper Energy directed folding of RNA sequences

Nucleic acids research� ��

�� Bruce A Shapiro and K Zhang Comparing multiple RNA secondary struc�

tures using tree comparisons CABIOS� ��

�� Yasubumi Sakakibara� Michael Brown� Rebecca C Underwood� I Saira Mian�

and David Haussler Stochastic context�free grammars for modeling RNA

Report� UC Santa Cruz� � �

�� Y Sakakibara� M Brown� R Hughey� IS Mian� K Sj�olander� RC Under�

wood� and D Haussler The Application of Stochastic Context�Free Gram�

mars to Folding� Aligning and Modeling Homologous RNA Sequences Report�

UC Santa Cruz� � �

�� R W Hamming Error detecting and error correcting codes Bell Syst� Tech�

J��

�� Albert Eschenmoser Hexose nucleic acids Pure and Applied Chemistry�

��

�� J Santa Lucia Jr� R Kierzek� and DH Turner E�ects of GA mismatches

on the structure and thermodynamics of RNA internal loops Biochemistry�

� ��

�� I Tinoco Jr� M Chastain� and X Chen Non�watson�crick base pairs and

rna structure Clinical chemistry� ��

�� Heinz W Pley� Kevin M Flaherty� and David B McKay Three�dimensional

structures of a hammerhead ribozyme Nature� ��

�� Heinz W Pley� Kevin M Flaherty� and David B McKay Model for an RNA

tertiary interaction from the structure of an intermolecular complex between

a GAAA tetraloop and an RNA helix Nature� ��

� � �


�� M S Waterman Secondary structure of single � stranded nucleic acids

Studies on foundations and combinatorics Advances in mathematics supple�

mentary studies Academic Press N�Y��

�� IL Hofacker� P Schuster� and PF Stadler Combinatorics of secondary

structures Submitted to SIAM J� Disc� Math��

�� GK Zipf Human Behaviour and the Principle of Least Eort Addison�

Wesley� Reading�Mass��

�� YS Chen Zipf�s law in text modeling Int� J� General Systems� ��

�� Benoit B Mandelbrot The Fractal Geometry of Nature� Freeman * Co�

New York� � �

�� AV Aho� JE Hopcroft� and JD Ullman Data Structures and Algorithms

Addison�Wesley� Reading� MA� � �

�� D E Knuth The Art of Computer Programming Volume �� Sorting and

Searching Addison�Wesley� Reading� MA� � ��

�� Walter Gr�uner Evolutionary Optimization on RNA Folding Landscapes PhD

thesis� Inst of Theoretical Chemistry� Uni Vienna� Austria� June � �

�� Ivo L Hofacker A Statistical Characterisation of the Sequence to Structure

Mapping in RNA PhD thesis� University of Vienna� � �

� � �


Table of Contents

� Introduction �

� RNA Secondary Structures �

�� RNA Folding �

�� Representation of Secondary Structures �

�� Folding as a Combinatory Map �

�� Compatibility of Sequences with Structures

�� Neutral Networks ��

�� Rearranging the set of compatible sequences ��

�� De�nition of neutral networks ��

�� Neutral networks of RNA secondary structures ��

�� Random graph theory of neutral networks ��

� Exhaustive Enumeration of Secondary Structures ��

�� Folded Secondary Structures ��

�� An upper bound to the number of folded structures ��

�� A lower bound to the number of folded structures ��

�� Exhaustive structure generation �

�� Numerical Estimates From Exhaustive Enumeration ��

�� The number of minimum free energy structures ��

�� Common structures ��

�� The Distribution of Preimage Sizes ��

� The Sequence of Components ��

�� Algorithmic Considerations ��

�� Tries ��

�� Connected component analysis �

�� A remark on programming methodology ��

�� A Sample Session� GC��

� i �


�� Connectivity in GC��

� Shape Space Covering �

� Conclusions ��

Acknowledgements ��

References ��

� ii �

Analysis of RNA Sequence Structure Maps by Exhaustive ... · Analysis of RNA Sequence Structure...

Documents

Transcript of Analysis of RNA Sequence Structure Maps by Exhaustive ... · Analysis of RNA Sequence Structure...