Analysis of RNA Sequence Structure Maps by Exhaustive ... · Analysis of RNA Sequence Structure...
Transcript of Analysis of RNA Sequence Structure Maps by Exhaustive ... · Analysis of RNA Sequence Structure...
Analysis of RNA SequenceStructure Maps by ExhaustiveEnumerationWalter GrünerRobert GiegerichDirk StorthmannChristian ReidysJacqueline Weber
SFI WORKING PAPER: 1995-10-099
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu
SANTA FE INSTITUTE
Analysis of RNA Sequence Structure Maps
by Exhaustive Enumeration
By
Walter Gr�unera� Robert Giegerichb� Dirk Strothmannb�
Christian Reidysa� Jacqueline Webera�
Ivo L� Hofackerc� Peter F� Stadlerc�d� and Peter Schustera�c�d��
aInstitut f�ur Molekulare Biotechnologie
Beutenbergstra�e ��� D����� Jena� Germany
bTechnische Fakult�at� Univ Bielefeld
Postfach ������� D������ Bielefeld� Germany
cInstitut f�ur Theoretische Chemie� Universit�at Wien
W�ahringerstra�e ��� A��� � Wien� Austria
dSanta Fe Institute
�� Hyde Park Rd� Santa Fe� NM ����� USA
�Mailing Address�Institut f�ur Molekulare Biotechnologie
Beutenbergstra�e ��� PF ��� ��� D����� Jena� GermanyPhone� ��� ������ �� ���� Fax� ��� ������ �� ����
E�Mail� pks�imb�jena�de
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Abstract
Global relations between RNA sequences and secondary structues are understood as mappingsfrom sequence space into shape space� These mappings are investigated by exhaustive foldingof all GC and AU sequences with chain lengths up to ��� The technique od tries is usedfor economic data storage and fast retrieval of information� The computed structural data areevaluated through exhaustive enumeration and used as an exact reference for testing analyticalresults derived from mathematical models and sampling based of statistical methods� Severalnew concepts of RNA sequence to secondary structure mappings are investigated� among themthe structure of neutral networks �being sets of sequences folding into the same structure��percolation of sequence space by neutral networks� and the principle of shape space covering �The data of exhaustive enumeration are compared to the analytical results of a random graphmodel that reveals the generic properties of sequence to structure mappings based on some basepairing logic� The di�erences between the numerical and the analytical results are interpreted in
terms of specic biophysical properties of RNA molecules�
�� Introduction
Conventional biophysics considers sequence structure relations of biopolymers pri�
marily aiming at a solution of the folding problem� given a sequence which struc�
ture does it form under the speci�ed experimental conditions� for example the
minimum free energy structure formed at thermodynamic equilibrium or kineti�
cally determined structures corresponding to the outcome of the folding process
under certain conditions Many problems in current molecular biology and bio�
technology ���� however� raise questions that cannot be answered satisfactorily by
this approach Required is instead a di�erent view that considers the set of all
�possible� sequences as an entity which is mapped onto the set of all �possible�
structures Such problems are� for example� the sensitivity of structures against
mutations in the underlying sequences ���� the in�uence of nucleotide distributions
��A� �U� �G� �C� on structures ��� �� as well as the inverse folding problem�
given a RNA secondary structure which sequences fold into it under the speci�ed
conditions ��� ��
The evolution of RNA molecules in replication assays� viroids and RNA viruses
can be viewed as an adaptation process on a �tness landscape in the sense of
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Sewall Wright�s imagination ��� The dynamics of evolution is thus tightly linked
to the structure of an underlying landscape Global features of landscapes can be
described by statistical measures like numbers of optima� lengths of walks� and
correlation functions� see� for example� ��� � �� � Statistical characteristics of
RNA landscapes are accessible on the level of secondary structures by mathemati�
cal analysis and computer calculations� these RNA landscapes belong to the same
class as well known optimization problems and simple spin glass models ��� ���
The notion of a landscape has been extended to combinatory maps� thereby allo�
wing for a direct statistical investigation of the sequence structure relationships of
RNA at the level of secondary structures ��� �� Extensive computational studies
have revealed that the frequencies of structures are highly non�uniform� that se�
quences sharing the same structure are distributed randomly over sequence space�
that there exist neutral paths in sequence space along which structures remain
une�ected by mutations� and that any desired secondary structure is formed by
sequences that can be found close to an arbitrary initial sequence These results
provide convincing evidence that RNA landscapes are as simple as they could
possibly be for evolutionary adaptation The consequences for evolutionary op�
timization� the early stages of life� and molecular biotechnology are immediate
Based on these �ndings a random�graph theory was developed ���� that explains
the structure of neutral networks in terms of a single parameter� the frequency of
neutral mutations The predictions of this theory� among them connectedness and
density of the neutral networks cannot be veri�ed by a statistical approach based
on sampling tiny fractions of sequence space Comparison of the results derived
from the random graph approach with real RNA folding data allows to separate
generic properties of sequence to structure mappings from nucleotide speci�c bio�
chemical phenomena
In this contribution we report the computational techniques that are necessary
to exhaustively generate all sequences and their secondary structures for two�
letter alphabets up to a chain length of n � �� In essence� the main task is
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
easily stated� compute the secondary structures of all sequences and then group
together the sequences that fold into the same secondary structure� ie� produce
an explicit representation of all neutral networks Once this has been done the
analysis proceeds by determining the geometric structure of these networks
This paper is organized as follows� In section � we review the folding algorithm
and the data structures used to represent RNA structures In section � we present
data obtained directly from the exhaustive search� such as the over�all number
of di�erent minimum free energy structures� the fraction of open structures� and
the distribution of preimage sizes Section � is analyzes and discusses the internal
structures of neutral networks� in particular the size distributions of their connec�
ted components Section � gives a glimpse at the relative locations of neutral
networks belonging to di�erent structures We present strong evidence for �shape
space covering�� almost all common structures can be found within a fairly small
ball around any random sequence Section � contains a discussion of the results
and in particular a comparison with the random graph approach
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�� RNA Secondary Structures
����RNA Folding
The biochemical and biophysical properties of RNA molecules are determined by
their spatial structures In case of RNA the process of folding the one�dimensional
primary structure �sequence� into the three�dimensional tertiary structure can be
decomposed into two steps�
��� Folding of the sequence into a secondary structure by formation of comple�
mentary Watson�Crick base pairs� G�C and A�U� and the weaker G�U
pairs
��� Formation of the three�dimensional tertiary structure from the planar pattern
Such a decomposition is meaningful since the intramolecular forces stabilizing the
secondary structures base pairing and base pair stacking are much stronger
than those accounting for arrangement of the secondary structure elements in
space Thus the free energy of formation for the three�dimensional structure can
be estimated by the free energy of the formation for the secondary structure The
dominant role of secondary structures is also well documented in nature since
secondary structure elements are conserved in evolution ���� ��� ��� ���
Table �� Folding Algorithms for RNA Secondary Structures
Algorithm � Abbr� Remark Reference
deterministic
Minimum Free Energy � MFE fast ��� � �
Kinetic Folding ! KIN fast ���
����� Folding ! ��� fast ���
Partition Function � PF ensemble ���
Maximum Matching � MM unrealistic ���
stochastic
Simulated Annealing ! SA very slow ��� ���
� Pseudo�knots can be included� The major problem with the prediction of pseudo�knots is�however� the lack of su�cient experimental energy parameters�
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
A variety of computer programs predicting RNA secondary structures have been
published A very brief overview is given in Table � Two public domain packages
for RNA folding are currently available by anonymous ftp� Zuker�s mfold ����
and the Vienna RNA Package ���� All these programs make use of essentially the
same energy model for the formation of secondary structures It explicitly assumes
that there are no knots or pseudo�knots
interior base pairs
G
C
A
U
C G
A A
5
3
closing base pair
multiple loop
5
3
5
closing base pair
G
C A
U
interior base pair
5
closing base pair
G
C
A
U
C
3
5
3
closing base pair
stacking pair
interior loop
G A
UC
G
G
G
C
CA
A5
3
closing base pairinterior base pair
hairpin loop
bulge
UAinterior base
pair
3U A C G C A A U
joint free end
joints and free ends
Figure �� RNA secondary structures can be decomposed into loops and external elements� Theenergy of a particular structure is the sum of energy contributions for all its loops�
Each secondary structure is viewed as being composed of stacked base pairs� loops�
and external elements which are neither part of a stack nor of a loop For the sake
of a uniform notation two stacked base pairs can be viewed formally as a special
type of loop consisting of exactly four nucleotides Depending on the topology of
the loop one distinguishes di�erent loop types� hairpin loops have only one base
pair� stacked base pairs consist of exactly two base pairs� bulges have two base pairs
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
adjacent to each other and at least one unpaired base� interior loops have two base
pairs which are not adjacent to each other� and multi�loops contain at least three
base pairs� see Figure � The energy of a secondary structure is the sum of energy
contributions of all loops in the structure These contributions depend on the
loop type� the loop size� and partly on the particular sequence of nucleotides in
the loop The individual energy parameters were determined experimentally� see
eg ���� ��� �� While the prediction of the folding algorithm for a particular
sequence depends strongly on the details of the parameter set� such details do not
a�ect the global features of the sequence structure relations �� �
The data used in this contribution have been produced with the fold�� algorithm
contained in the Vienna RNA Package ���� It is based on a high performance
implementation ���� of the Zuker�Sanko� algorithm ���� The energy parameters
used in this package are an updated version of ��� provided by Danielle Konings
����
����Representation of Secondary Structures
RNA secondary structures are commonly drawn as secondary structure graphs in
the biochemical literature Equivalently� though less intuitively� circle plots are
used sometimes� therein the sequence is arranged in a circle� and base pairs are
indicated as chords connecting the pairing nucleotides The non �pseudo�knot
condition implies that chords do not intersect in this representation A number
of equivalent representations have been developed for special applications Secon�
dary structures can be translated into rooted planar trees by mapping base pairs
to internal nodes and unpaired nucleotides to leaves ��� �This tree representation
is equivalent to a mountain representation ���� obtained by indicating a nucleotide
which is paired with a base towards the �� end by a positive slope� a base with
pairing partner to the �� end by a negative slope� and an unpaired base by horizon�
tal line segment� Both the tree representation and the mountain representation
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
have been used for comparing secondary structures and for computing similarity
measures between di�erent secondary structures ���� ���
The tree and the mountain representation are equivalent to a string representation
A base paired with a partner towards the ���end is denoted by "��� a nucleotide
pairing with a partner towards the ���end by "��� and "� is used for representing
unpaired positions Each secondary structure is then uniquely determined by a
string of length n taken from the alphabet "��� Not all strings formed from this
alphabet are valid secondary structures� biophysical constraints require that a
hairpin loop contains at least three unpaired bases and that each open bracket "��
has to be matched by a closed bracket "�� It is easy to see that the set of all valid
secondary structures of chain length n � � is in fact generated by the context free
grammar
G � S � "� j "�S j S"� j SS j "��S"�� �
This representation is used for I#O in the Vienna RNA Package It is also used in
this contribution Given a string like "������ its parse according to the grammar
G uniquely determines pairs of matching parentheses Similar grammars are used
for modeling tRNA structure families in ���� ��� Relating some sequence x to
some structure s� we call xi� xj to be in contact or matched ����and ����positions
when the positions i and j are a base�pair in s� ie they correspond to matching
parentheses in s
����Folding as a Combinatory Map
Natural RNA sequences are strings of length n over the alphabet fG�C�A�Ug The
canonical distance measure between two sequences of equal length n is the Ham�
ming distance ���� counting the positions in which to end�to�end aligned sequences
di�er
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
The relation between sequences and structures is mediated by a folding algorithm�
see above� in our case the routine fold�� of the Vienna RNA Package In the
reminder of this paper we will write f�x� instead of fold�x� The biological
question of sequence�structure relations translates into the mathematical question
of determining properties of the mapping
f � Qn� � Yn �
where Qn� is the generalized hypercube �Hamming graph� of dimension n over an
alphabet of size � and Yn is the shape space of all secondary structures of length n
The shape space can be viewed as metric space by either using the trivial metric
or one of the distance measures based on the tree representation or mountain
representation of secondary structures
The term combinatory map was introduced for mappings from a discrete con�gu�
ration space into some metric space as a generalization of the notion of a landscape
� � It is clear that the folding map f is not necessarily injective� since there are
less than �n secondary structures as compared to �n sequences From the structure
statistics given in ��� immediately follows that f is not onto in general� structures�
as derived from the grammar G� typically contain many isolated �non�stacked�
base pairs Consequently� the preimage f���s�� ie� the set of all sequences ac�
tually folding into s� will be in general a fairly large set The structure statistics
��� indicates� however� that only a vanishingly small fraction of all secondary struc�
tures obtained from folding contains isolated base pairs Consequently� the image
of the sequence space� f�Qn��� will in general be a proper subset of Yn
The obvious questions to ask about f are thus the following
� How large is the image of sequence space f�Qn��$
� How large are the preimages of given structures$
� What is the distribution of preimage sizes$
� How many sequences do not form secondary structures� ie� how large is the
preimage of the open structure$
� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
� How is f���s� embedded in sequence space$
Many of these questions have been at least partially answered in previous papers
describing non�exhaustive computer simulations ��� � �� �� ��� ��� as mentioned in
the introduction Based on these results a random graph model ���� was concei�
ved that allows to construct the preimages of a �given� secondary structures s as
functions of a single parameter� the �average� fraction of neutral neighbors of the
sequences folding into s �see the section on neutral networks below�
The random graph model reveals the generic properties of sequence to structure
mappings based on base pairing It is based merely on the existence of some
de�nitions of legal pairings that need not be the complementary pairs �preferential
AA orGG pairs as observed in homo�DNA ���� would be equally acceptable� The
exhaustive enumeration data presented here can be used to check the predictions
of the random graph model and they may help to detect and interpret systematic
deviations of data obtained by RNA folding from this idealized reference
����Compatibility of Sequences with Structures
The mere notion of base pairing implies an a priori relation between sequences
and structures which depends only on the legal base pairs in a given alphabet� but
which is completely independent of the particular values of the energy parameters
For instance� the legal base pairs in the natural AUGC alphabet are AU� UA�
GC� CG�GU� and UG Note that we neglect non�standard pairings such as AA�
GA� AG� or UU which have been observed in natural RNA structures ��� � �
��� ���
De�nition� A sequence x is compatible with the secondary structure s if �xi� xj�
is a legal base pair for all matched "��� and "���positions i and j ���� Given a
secondary structure s we shall denote the set of all compatible sequences by C�s�
� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Of course� the mere fact that x � C�s� does by no means imply that f�x� � s�
ie� that x really folds into the secondary structure s Trivially� we have however
f���s� � C�s�� ie� all sequences folding into a given secondary structure have
to be compatible with it The notion of compatible sequences is essential for any
deeper understanding of sequence structure relationships in RNA
���Neutral Networks
������Rearranging the set of compatible sequences
Since our main interest is the representation of f���s� in sequence space let us
�rst consider the geometry of the sets C�s� of compatible sequences in the sequence
space Qn� The structure of C�s� is complicated by the di�erence between paired
and unpaired positions C�s� does not in general give rise to a connected subgraph
of sequence space� but decomposes into hyperplanes which are characterized by a
particular choice of the base pairs at paired positions �Neighboring� hyperplanes
have either Hamming distance dH � �� for instance if GC is replaced by GU� or
Hamming distance dH � � if GC is replaced by any other legal pairing �AU� UA�
CG� UG� Note that GC�GU�AU gives rise to three hyperplanes which are
connected with each other%
Allowing for both point mutations and base pair exchanges rede�nes the neigh�
borhood relations in C�s� by introducing new edges We shall denote this graph
by C�s� It is clearly connected In the following we will examine its structure in
detail Given a seconary structure s� a sequence x � C�s� has any one of the �
letters at each position i for which i � "�� while positions j and k corresponding
to matching brackets "�� and "�� are positions that are occupied by any one of the
� legal base pairs Thus a sequence x � C�s� can be represented by the nucleotides
in its unpaired positions� and the encoding base pairs in terms of letters from an
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�������������������� �������������������� structure
CGCCGGCGGCGCCCGGGGCC AUGGGUCUCCGACAGUCCGG sequence
C���GGCCGGC A�����UCCAGUG base pairs reduced
������� ����������� unpaired bases reduced
������� ����������� reduced sequence
G�� C� A�� U��
GC�� CG� AU�� UA�� GU� UG��
Figure �� The e�ect of procedure reduce�
alphabet of size � at the positions of the open brackets All letters at the closed
brackets can then be deleted An example is given in Figure �
Later in this paper we will make use of the procedure reduce performing this
contraction of the sequence�
reduce�s�� �� � � � �
reduce�s�� ��x� � ��� reduce�s� x� �
reduce�s�� ��x� � ��� reduce�s� x� and
reduce�s�� ��x� � reduce�s� x� �
Given the reduced sequence and the structure� one can of course reconstruct the
original sequence In fact� it is easy to write a function expand such that
expand�s� reduce�s� x�� � x
for all s � Y and all x � C�s�
Reduced strings can be rearranged further� we �rst write the u unpaired positions
and then the p base pairs �n � u! �p� Thus we can interpret reduce�s� x� as an
element of the direct product space Qu� Q
p� � ie� C�s� is isomorphic to the product
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
of two Hamming graphs with possible distinct alphabets� which correspond to the
unpaired and paired positions� respectively As an immediate consequence of these
considerations we note the following
Lemma� Let x� y � C�s� The graphical distance of x and y in the graph C�s�
coincides with the Hamming distance of their reduced representations�
dC�x� y� � dH� reduce�s� x�� reduce�s� y� � �
In particular� there is an edge in the graph C�s� if and only if
dH� reduce�s� x�� reduce�s� y� � � � �
Using reduced sequences provides two advantages�
��� The storage requirements are reduced We can represent the preimage f���s�
now as the pair �s�X�� where X is the list of all reduced sequences folding into
s Note that the length of a reduced sequence is
jreduce�s� x�j � u! p � n� p�
In the case of the GC alphabet this reduces the memory requirements by
about ���
��� Comparison of two sequences is more e&cient� again because the representa�
tion is shorter
��� The distance between two sequences can be computed very e&ciently since it
coincides with their Hamming distance by the above lemma
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
������De�nition of neutral networks
Mutations which do not a�ect the �tness of an organism are called neutral in
biology By the same token� the term neutral is used in the context of RNA
to mean mutations which do not alter the �secondary� structure Hence f���s�
contains all the sequences which can be considered as neutral mutants of each
other
De�nition� The set f���s� considered as an induced subgraph of C�s� is called
the neutral network� N �s�� of the secondary structure s
The motivation for the term �network� will become clear later on De�ning N �s�
as induced subgraph of C�s�� rather than as induced subgraph of the sequence space
Qn� itself� avoids the peculiarities introduced by the logic of base pairing On the
other hand� the neighborhood relation no longer coincides with the action of mu�
tation Hence we have traded technical tractability for biophysical interpretation
The most important characteristic of a neutral network N �s� is its connectivity
In order to retain as much biologically relevant information as possible we consider
the unpaired and the paired part of the sequence separately and de�ne �u�s� and
�p�s� as the average fraction of neutral mutations in Qu� and Q
p� � respectively
������Neutral networks of RNA secondary structures
Let us brie�y summarize the properties of the neutral networks of RNA minimum
free energy structures
� The distribution of preimage sizes follows roughly a so called generalized Zipf�s
law� ie� their rank�order statistics follows a distribution function of the form
��r� � A�� ! rB���
where r is the rank�order of a structure� ��r� its frequency� A gives the ab�
undance of the most frequent structures� B � measures the number of
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�frequent� structures� and � � determines the shape of the power�law tail
���
� A neutral path is a path fx�� x�� � � � � x�g in N �s� such that dC�x�� x�� � ��
ie� a neutral path if obtained by selecting a start sequence x� folding into
s and then successively choosing neutral neighbors in such a way that the
�graphical� distance to the starting sequence is increased with each accepted
step The length � of a neutral path is therefore a lower bound on the diameter
of the neutral network N �s� Computer studies based on RNA folding have
shown that the average values of � are much larger than the distance between
randomly chosen sequences for all common structures A precise de�nition
of a common structure will be given below� see sect �� In this sense� the
neutral networks of common sequences reach through all of sequence space�
or more precisely� the set of compatible sequences ��� � �
� Inverse folding provides a means of estimating the distance from a random
sequence to the nearest sequence folding into a desired target structure s
One �nds that sequence space is covered with fairly small balls each of which
contains almost all common structures The radius R of these balls is only
slightly larger than the average distance from a random sequence to the set
of compatible sequences for any common structure s This property has been
termed shape space covering in ���
� The connectivities �p and �u become constant for large n when averaged over
large samples of randomly chosen sequences ����� ie� RNA sequence�structure
maps are characterized by a very high degree of neutrality
������Random graph theory of neutral networks
The �ndings from the computational studies as listed above prompted us to search
for a generic statistical model �with as few parameters as possible� that could
explain the data As a �rst step we have recently proposed a random graph model
explaining structures and properties of individual neutral networks ����
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
As already stated C�s� is the direct product of the hypercubes Qu�Q
p� and proceed
by constructing random subgraphs in each of the graphs Qu��Q
p� separately This
is done by selecting each vertex of Qu�� Q
p� with independent probabilities �u and
�p Here �vertex in Qu�� means the �unpaired� part of the sequence x and �vertex
in Qp�� means the �paired� part of the sequence x �that is compatible wih s�
Thereby one obtains randomly induced subgraphs 'u Qu��'p Qp
� and the
neutral network N �s� is given by
N �s� def
���'u 'p �
The random graph approach does not deal with speci�c biochemical or biophy�
sical features of the folding process by doing two random selections� one for the
unpaired part of a sequence and the other for the paired part A sequence folds
into s with probability �u �p However� the basic parameters have a biochemi�
cal interpretation namely to be for the �unpaired� part the expected fraction of
neutral sequences in Hamming distance one and for the �paired� part the fraction
of neutral neighbors for simultaneous base pair exchanges In other words the
parameters re�ect the stability of the structure s under point mutations in the
�unpaired region� and base pair exchanges �ie� mutations that preserve compa�
tibility� in the �paired regions�
Table �� Asymptotic values for the fractions of neutral neighbors �u and �p
� Model I Model II Alphabet RNA Structures
��d ��c ��d � ��c unpaired paired
� ��� ��� � ��� GC ����� �����AU ����� ��� �
� ������ ������ ������ GCXK ���� ���� GCAU ��� �
� ������ ������ ������ GCAU �����
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
The theory of randomly induced subgraphs of sequence spaces predicts that� in
the limit of long sequences� there exists a critical value �� such that a neutral
network is a dense and connected subgraph of C�s� if � �� A subgraph G of a
�nite graph H �G H� is dense in H if and only if each vertex of H is either in
G or has at least a neighbor in G Neutral networks show a typical percolation
phenomenon in sequence space Moreover for all �u� �p � there exists in the
limit of long sequences a so called giant component i e most of the sequences are
pairwise connected in C�s� Conversely� the network is �in its projections Qu��Q
p��
neither dense nor connected if its ��value is below the critical value A later and
re�ned version takes into account the di�erent connectivities for unpaired and
paired positions Table � compiles the critical connectivities for the two variants
of the random graph model as well as for sequences from di�erent alphabets The
predictions of the random graph model concerning the connectivity or percolation
problem and the density of subgraphs were checked by exhaustive ennumeration
The data shown in table � indicate in full agreement with previous studies ���
that neutral networks of structures GC�sequences are more likely to fall below the
percolation threshold than those derived from AU� or AUGC�sequences
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�� Exhaustive Enumeration of Secondary Structures
����Folded Secondary Structures
������An upper bound to the number of folded structures
Of course� the number of folded secondary structures cannot exceed the number Sn
of possible secondary structures Sn can be calculated from the simple recursion
Sn � Sn�� !n��Xk�m
Sk Sn�k�� n � m! � S� � S� � ��� � Sm�� � �
wherem is the minimum number of unpaired digits and which follows directly from
the grammar G� see ���� In the biophysically relevant case m � � the asymptotics
of Sn are given by�
S���n ������ n���� �����n
As mentioned above� isolated base pairs are extremely rare in folded secondary
structures Hence� a better estimate can be obtained by counting the number
of secondary structures S���n which do not contain isolated base pairs� ie� all
base pairs are contained in stems of length at least two A recursion and an
asymptotic expression for this series was recently derived ���� It turns out that
S���n is signi�cantly smaller than �n�
S���n ���� n���� �����n
Most structures counted this way still exhibit much shorter helices than average
folded structures We expect therefore that S���n still overestimates the number of
folded secondary structures
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
������A lower bound to the number of folded structures
A non�trivial lower bound can be obtained from estimating the size of the set
of sequences compatible to a given structure s Indeed� we know that ��� of
all sequences fold into structures with at least p�n base pairs� where p� is some
constant with � p� � independent of n� see ��� Therefore� ��� of the �n
sequences fold into structures which have compatible sequences of size at most
jC�s�j � ����p��p�n � �n �
��
��
�p�n
where � � �� is the number of di�erent legal base pairs in the given alphabet
Table ��Lower Bounds on the Number of Folded Structures
Alphabet � � p� r � �����p�
GC � � ����� �����AU � � ����� ����AUGC � � ��� � ���� AUGC� � � ����� �����GCXKa � � ����� �����ABCDEFa � � ���� ��� �
� GU pairs suppressed�a These articial alphabets contain two �three� complementary base pairs with identical energyparameters�
Even if all sequences in C�s� would fold into the same structure� and if all compa�
tible sequences were disjoint� there must be at least
Sl�b�n �
�
�
���
�
�p�n
di�erent secondary structures Numerical estimates for p� can be obtained from
the data for the average number of base pairs in folded secondary structures � ��
�It was in fact shown that the opposite is true ���� The compatible sets of any two secondarystructures have a non�empty intersection�
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
see table � The number of base pairs is concentrated around p�n for large n
Even if mean and median would di�er signi�cantly in this distribution� it would
only lead to a �minor� correction of the pre�factor �� without a�ecting the expo�
nential part of the bound Sl�b�n It is interesting to note that r � ��
���p� does
not strongly depend on the alphabet Just as the upper bound discussed in the
previous subsection� this lower bound is far from being sharp as well
������Exhaustive structure generation
Because of the intractably larger numbers of sequences over alphabets with more
than � � � letters� we restrict our numerical investigations to two letter alphabets
We choose two examples GC and AU In the latter case the data are expected
to re�ect the short�sequence e�ects much stronger because the energy parameters
of AU base pairs are such that sequences of tractable length are very likely to be
unfolded
The sequences are generated in collections of �ve million sequences and their re�
spective complements each The sequences are folded using fold�� and then
grouped according to their structures After a section is completed� all sequences
x folding into the secondary structures are appended to the UNIX �le FILE�s�
Sequences are stored in ���bit machine words� with � and � replacing G and C�
respectively This method of storage� which limits the sequence length to ��� does
not restrict the approach in praxis� since a length of n � �� is already above
the limits of both the accessible CPU resources and accessible storage capacities
In fact� longest RNA molecules we have investigated have chain length n � ��
This computation required about ��� days of CPU time on an IBM RISC ����
workstation with ��� MB RAM� and more than �� GB of disc space
The bounds on the number of secondary structures were important for organizing
the disc storage of the data In the worst case� the upper bound given above leads
us to expect about a million di�erent secondary structures for n � �� This is by
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
far more than the number of �les that can be contained in a single directory of
the UNIX �le systems on our workstations In order to overcome this restriction�
the �les are arranged in a hierarchical directory structure limiting the number of
�les per subdirectory to about ���� For n � ��� this directory structure alone
occupies about ��� MB of disc space
����Numerical Estimates From Exhaustive Enumeration
������The number of minimum free energy structures
The upper and lower bounds on the number of secondary structures� which we
have discussed in the previous sections have been based on fairly crude estimates
Much more accurate estimates for the actual number of di�erent structures that
are realized by a particular folding algorithm can be obtained by extrapolation
from exhaustive counts for short chains
Linear regressions of a logSMFEn versus n plot of the data shown in �gure � yields
the following estimates
SMFEn ������ ����� � ������� �������n for GC�
SMFEn ����� � ������ ���� ���� �n for AU�
The estimate is quite good for the GC alphabet� while for the AU alphabet �nite
size e�ects seems to dominate This is due to the weaker base pairs� which leave
a large fraction of the sequences unfolded in their minimum free energy structure
The exact numbers of di�erent structures can be found in table �
The abundance of open structures �of the type ��������� is therefore a good indicator
for the in�uence of �nite size e�ects For GC we �nd a distinct exponential
decrease in the fraction of sequences that do not form base pairs� see �gure �
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
0 5 10 15 20 25 30chain length n
100
101
102
103
104
105
106
|S|
0 5 10 15 20 25 30chain length n
10-5
10-4
10-3
10-2
10-1
100
Fra
ctio
n of
Ope
n S
truc
ture
s
Figure �� Numbers of structures and abundance of the open structure obtained by exhaustiveenumeration computed for GC� �� and AU� � alphabets� Numbers of structures jSjare presented as functions of the chain length �l�h�s��� The fractions of sequences thatfold in to the open structure are shown on the r�h�s�
In the AU case almost half of the sequences with a chain length of n � �� are
shorter do not form base pairs Consequently we cannot expect reliable estimates
for the asymptotics of SMFEn in this case We �nd indeed that number SMFE
n
is much smaller for both alphabets then the combinatorial estimate discussed in
sect ���
������Common structures
De�nition� A structure s is said to be common if its preimage f���s� is not
smaller than the average size of a neutral network ���� that is in mathematical
notation� if
jf���s�j �Xs��S
�
jSjjf���s��j �
�n
jSj�
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table ��Common Structures and their Preimages
GC AU
n jSj rc nc jSj rc nc� � � �� � � � � � ��� � � � � � ��� � � ��� �� � �� � � ��� �� �� ��� � � ��� �� �� �� ��� � � ��� � �� �� ��� � � ��� �� �� ��� �� � � ��� ��� �� � � ��� � � ��� ����� � � �� � � � � � � ��� ��� ��� � ���� ��� � ���� ��� �� �� ���� ��� �� � ��� ���� �� �� ���� �� �� � ���� ���� �� ��� �� ��� ��� �� � ��� ����� �� ��� ��� �� ��� ��� �� � �� ���� � ��� �� �� ��� �� ���� ��� � � �� ��� ���� �� �� �� ��� �� ��� �� � � �� ���� ����� ��� �� �� �� ��� ��� � � ��� � ��� ��� ���� �� � � �� �� ��� ���� ��� � � ��� ��� ��� �� ��� ��� ��� � � ��� �� �� � � ����� � � � �� ��� ���� ���� �� � � �� �� �� � �� ���� ��� ��� � � �� ���� �� ��� ��� � �� � � ��� � � ��� ��� �� ��� �� � ���� ��� ���� � � ��� ���� � �
� All sequence fold into the open structure�
Furthermore� let rc denote the rank of the rarest common structure� ie� the
structure with rank rc is common� but the structure with rank rc ! � is not
common
The number of sequences that fold into common structures� nc is listed in Table �
Two quantities are of particular interest� the fraction rcjSj of common structu�
res within the set S of all structures that are obtained by folding� �gure ��lhs��
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
and the fraction nc�n of sequences that fold into common structures� see ��
gure ��rhs�
0 5 10 15 20 25 30chain length n
0.01
0.10
1.00
ρ
0 5 10 15 20 25 30chain length n
0.70
0.75
0.80
0.85
0.90
0.95
1.00
ν
Figure �� Fraction of common structures and fraction of sequences folding into common struc�tures computed for GC� �� and AU� � alphabets� The chain length dependenceof rc�jSj � � �l�h�s�� seems to decrease with increasing chain length whereas thechain length dependence of nc��n � � �r�h�s�� strongly indicates convergence towardslimn�� � � ��
The fraction of common structures among all minimum free energy structures
decreases with chain length The data for the GC alphabet as consistent with
an asymptotically exponential decrease of rcjSj The data for the AU alphabet
exhibit a minimum at n � �� A trend for long sequence� however� cannot be read
o� the data In analogy to the results for GC sequences and based on the Zipf�s
law type distribution of preimage sizes we expect� that rcjSj approaches � as n
becomes large
The fraction of sequences that fold into common structures slowly increases in
the case of GC sequences( for n � �� we have nc��� � �� ��� ie� less than ��
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
of all sequences fold into non�common structures For AU we �nd a minimum
at n � �� with about ��� of the sequences folding into common structures� and
a steep increase of this fraction for longer chains The data are consistent with
the conjecture that asymptotically almost all sequences will fold into common
structures
����The Distribution of Preimage Sizes
The distribution of preimage sizes has been studied previously for coarse grained
structures on sequences up to length ��� by folding samples of several million
random sequences ��� ��� These studies� which access the distribution of preimage
sizes in the realm of common structures and the onset of the tail have suggested the
distribution follows a generalized Zipf�s law ���� ��� which can be parameterized
as ����
��r� � A�� ! rB��� with B � and � ��
see section ���
The exhaustive folding of the entire sequence space allows us to compare this model
distribution with the actual rank�order statistics of RNA secondary structures We
�nd qualitatively similar distributions with few frequent structures and a long tail
of very rare structures Quantitatively� the �t to the model function is rather poor
mostly because frequencies should be nearly constant at r � �� ie the model
function exhibits an even sharper transition from common to rare structures see
�gure � This may be partly a �nite size e�ect� since the transition clearly becomes
sharper with increasing chain length Already at chain length �� more than ��
of GC sequences fold into structures with r rc � ���� � ie the most frequent
��� of structures
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
100
101
102
103
104
105
Rank
10-9
10-7
10-5
10-3
10-1
Fre
quen
cy
Figure �� Distribution of preimage sizes for GC sequences length ��� The dashed line is a tto the model above using A � �������� B � ���� � � ����� The dot�dashed line is theintegral over the distribution� the vertical line marks rc � �� ���
�� The Sequence of Components
����Algorithmic Considerations
In modeling the sequence�structure map of RNA secondary structures on the com�
puter� we distinguish three levels of abstraction�
��� The problem level� sequences� structures� neutral networks and related noti�
ons introduced in the previous sections
��� The algorithmic level� lists� graphs and other data structures representing
the concepts of the problem level for algorithmic analysis
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
��� The storage level� machine words and UNIX �les for storing the data struc�
tures of the algorithmic level
The correspondence between the problem level and the algorithmic level is the
content of the formal �mathematical� representation of the underlying biophysics
The correspondence between algorithmic and storage level is established by a pair
of inverse functions that write to and read from background storage The enormous
volume of the data in this investigation mandates a careful choice of algorithms
and data structures
By de�nition� a neutral network N is a graph with node set fx j f�x� � sg and
an edge between a prir of nodes x and y if they di�er by a single character in
an unpaired position� or by a single switch of complementary characters in pai�
red positions The edges of N cannot be represented explicitly� since the space
requirements would be excessive A simple routine neighbour�x� y� can decide
whether �x� y� is an edge according to the above criterion� while N is just repre�
sented by its node set as a list of sequences Correspondingly� N is stored as a
UNIX �le of consecutive machine words Furthermore� the exhaustive generation
of the sequences and folding them into structures does not produce immediate
neighborhood information To �nd the neighbours of x in N �s� takes O� jN �s�j �
steps� and so the computational e�ort for the analysis of connected components of
N �s� is quadratic in jN �s�j Some networks contain millions of nodes and hence
this approach is infeasible
We now turn to the problem of �nding the neighbours of x � N �s�� which is
the basic primitive for determining connected components in a graph This is a
well�studied problem in computer science However� textbook algorithms usually
assume an explicit representation of edges� such that an edge �x� y� e�ectively
determines the neighbour y of x
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
������Tries
With N being a list of nodes� our problem is rather a string processing problem
and we can make use of standard string processing techniques ���� ��
De�nition� A trie� is a rooted tree with edges labeled by characters� such that
all outgoing edges of a node carry di�erent labels
A Btrie of depth n is a trie over a binary alphabet where all paths to a leaf have
the same length n Edge labels are optional by the convention that a left branch
indicates �� a right branch indicates �
1
0
0
0
1
1 1
1
1
0
0
0
0
e
th
e
e
h
r
r
i
er s
i
Figure �� Two examples of tries� A trie for fhere�his�their�there g �l�h�s� and a trie for f ���������� ����� ���� g �r�h�s���
A trie represents a set of strings by its paths read from the root towards the leafs
Figure � shows two tries A Btrie of depth n can represent up to �n sequences of
length n Yet independent of its size� it is decidable in constant time �n steps at
most� whether some sequence is stored in a trie Therefore tries are ubiquitous
data structures in text processing Putting together these ideas� a neutral network
N �s� is now represented in the algorithmic model as a pair �s� t�� where t is the
Btrie representing freduce�s� x� jx � N �s�g
A Btrie representing m sequences of length n can be constructed in O�mn� steps
Although this construction is optimal� it is too expensive to be carried out each
��Trie� is an articial word between �tree� and �retrieval�� It is usually pronounced to rhymewith �pie��
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
time the data are read from the disk Moreover� tries provide a kind of pre�x
sharing for the strings they represent For example� the trie in Figure � �rhs�
represents � strings of length �� but has only �� nodes Thus further savings in
time and space arise if we compactly represent networks as tries on background
storage
The function writeBtrie recursively traverses a Btrie in preorder and emits
� for a node with two subtrees�
�� for a node with a left subtree only�
�� for a node with a right subtree only
Leafs produce no output� making use of the fact that all sequences in the Btrie
are of a known �reduced� length n� �being identical with the depth of the trie��
as indicated by the corresponding structure This establishes a linear encoding
of a Btrie t in a pre�x code� and we can easily implement the inverse function
readBtrie such that
readBtrie�n�� writeBtrie�t�� � t �
For example� let t be the Btrie of Figure � �rhs� We have
writeBtrie�t� � ����������������� and readBtrie��� ������������������ � t�
Table � summarizes our algorithmic and storage model of sequence structure maps
With these techniques� the total space required for storing all neutral networks is
reduced from �� GB to � MB� an overall reduction of � �
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table � Summary of Data Representations
Problem level Algorithmic level Storage levelsequence x x � f�� �g�� machine word � �� bitsstructure s s � Language�G� ASCII stringcombinatory map f fold�� Algorithm C�Codeneutral network �s� t�� t Btrie representing Unix �le containing
N freduce�s� x� jx � Ng �s�writeBtrie�t��edge �x� y� in N implicit if
dH�reduce�s� x�� reduce�s� y�� � �Sequence of �s� t�� t�� � � � � tu� t�� with Unix �le containingcomponents ti representing Ci as Btrie �s� writeBtrie�t���
C�� C�� � � � � Cr t� Btrie of pseudo component
C� writeBtrie�tn��writeBtrie�t���
������Connected component analysis
Given two sequences x and y in N �s�� it is of special interest whether there exists
a path from x to y Equivalently we may ask whether x and y belong to the same
connected component of N �s� Computing a complete decomposition of N �s�
into a sequence of components� stored as a list of Btries� makes available an entire
solution for the whole network and further on a model describing size and quantity
of components in neutral networks
A simple decomposition algorithm starts with the extraction of an initial sequence
x from N �s� Finding and removing all neighbors of x from N �s� is the the
essential part of the computation This step is detailed below The neighbors
of x are temporarily stored in a list L while the initial sequence x can be saved
in a Btrie C representing the arising component Consecutively we take the �rst
element l� from L� compute and remove its neighbors from the decreasing N �s�
and append them to L Delete l� from L� then repeat the procedure
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
remove neighbors and append them to L
neighborsfind
C1C2C3
l1 l l l l l2 3 4 5 6
save in Btrie C3l5
L
(s)N
Figure � A snapshot from component analysis� Components C� and C� are completed� while
C� is under construction� C�� � � � C�� L andN �s� are disjoint and their union constitutesthe original neutral network of s� Once the components are stored on disk they aredeleted from N �s��
Finding L empty stops the procedure with the resulting connected component
stored in the Btrie C A complete decomposition of N �s� is achieved by rest�
arting the process with new initial sequences and extracting the corresponding
components until N �s� is empty We get a sequence C�� C�� � � � � Ck of components
as a list of Btries that are stored in the usual way Space optimization without
extra work is attained by storing all components consisting of only one sequence
in one pseudo�component C� Finding and removing all sequences neighboring to
a sequence x is done by the function findrem speci�ed by the equations �������
below If findrem�d� x� t� � �t�� �x�� � � � � xr�� then dH�x� xi� � d� and t� is the Btrie
obtained from t by removing x�� � � � � xr
A sequence x has at most n neighbors� and the e�ort of �nding and removing
them is O�n�� For jN �s�j � m� the complete algorithm is O�m � n�� in the worst
case For dense networks with m � �n� this approaches O�m � �log m���� making
connectivity analysis feasible
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
��� findrem��� � �� Leaf� � �Empty� �� ���( neighbor found and removed
��� findrem�n�w� Empty� � �Empty� � ��( no neighbors in empty Btrie
��� findrem��� aw� Node�l� r��
��red� Node�l�� r��
�map��xs� !! map��� ys� �� where
�l�� xs� � findrem�a� w� l� and
�r�� ys� � findrem��� a� w� r�
��� findrem��� �w� Node�l� r�� � �red�Node�l�� r��� map��� xs��� where
�l�� xs� � findrem��� w� l�
��� findrem��� �w� Node�l� r�� � � red�Node�l� r���� map��� ys� �� where
�r�� ys� � findrem��� w� r�
��� red�Node�l� r�� �
�Empty if l � r � Empty
Node�l� r� otherwise��� map�a� �w�� � � � � wn�� � �aw�� � � � � awn� for n � �
Figure � Specication of the function findrem�
������A remark on programming methodology
Implementing connectivity analysis on the computer required close cooperation
between biologists and computer scientists Such interdisciplinary work has its
own challenges In the early state� it is essential that the mutual understanding
is validated and exercised by a quick implementation of algorithmic ideas and an
exploration of alternatives In our case Miranda�� a purely functional language was
chosen to prototype the algorithmic model Due to the high level of abstraction� a
functional program for the above analysis is very short �only � pages� and clearly
re�ects the design decisions Only after our ideas had been tested and their bene�ts
on space and time requirements had been demonstrated� the implementation in C
was developed
�Miranda is a trademark of Research Software Ltd�
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
����A Sample Session GC��
The raw data� represented by an exhaustive list of all sequences contained in dif�
ferent �les according to their structure are generated using the command
ShapeSpace �l �� �g Here �l is an option for the chain length and �g means
generation of the raw data This command creates a hierarchy of directories con�
taining one �le for each structure The entries are then sorted and the index �le
GC ���ind is created in which each line contains the rank of a structure� the size
of its preimage� the dot�bracket representation of the structure and the location
of the actual data in the directory tree
The computation of the sequence of components is performed using the command
ShapeSpace �l �� �d It requires that the index �le has been created �rst The
data are written into a hierarchy of directories such that each directory contains
no more than ���� entries For instance� the data belonging to the structure of
rank ����� in GC�� will appear in the directory GC ����� with �lename ���
For long chains this presents a formidable task The computational bottle necks
are the following�
��� CPU resources are crucial for the folding and for the sequence of components
decomposition
��� Memory is the limiting resource while the raw data are generated
��� Access to disk space becomes a problem in the case of longer sequences as
partial results have to be moved to and from scratch disks
The �� � sequence of the sequence space GC�� fold into �� di�erent secondary
structures The most frequent structure is the open structure �������������� which
is obtained from about �� of all sequences The neutral network of the open
structure is exceptional in the case of short chains because it is connected For
longer chains it decomposes into �at least� two patches� one of which is centered
around poly�G and the other one centered poly�C
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table �� Decomposition of the Sequence Space GC��
Rank Size Sequence �u �p Sequence of
Components
� ��� ������������ ����� ����� ���� ��� ������������ ����� ����� ���� �� ������������ ����� ����� �� � �� ������������ ����� ����� ��� �� ������������ �� � �� �� �� � ��� ������������ �� � �� �� ���� ��� ������������ ����� ���� ��� ��� ������������ �� �� ��� ��� �� ������������ ����� ���� � ��� ��� ������������ ����� ��� � ��� ��� ������������ ����� ����� ���� ��� ������������ ���� ����� ���� ��� ������������ ����� ���� �� ��
�� ��� ������������ ����� ����� �� � �� � ������������ ���� ���� �� ���� �� ������������ ��� � ���� ���� �� ������������ ����� ����� �� ��� �� ������������ ���� ����� �� ��� �� ������������ ���� ����� � ���� � ������������ ����� ���� �� ��� �� ������������ ���� ����� ���� �� ������������ ���� ����� ���� �� ������������ ����� ����� ���� �� ������������ ����� ����� ��� �� ������������ ���� ����� ���� �� ������������ ���� ����� �� ��� �� ������������ ����� ����� � � �� �� ������������ ���� ����� �� �� ������������ ���� ����� ���� � ������������ ����� ����� ��� � ������������ ����� ����� �
Common structures have at least d�������e � ��� realizations in this example� therefore rc � ���
The most frequent structures have connected preimages consisting of only one
�single� component The neutral networks of the less frequent common structures�
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
however� decompose into two components The reason for this will be discussed
in the following section
����Connectivity in GC��
An analysis of the sequence of components of GC�� can be used to test the
random graph theory discussed in section ��� It predicts that a neutral network
corresponding to a structure with �u �� and �p �� should form a single
connected component� at least in the limit of long sequences Since a chain length
of n � �� is far away from this limit we cannot expect to �nd a perfect picture
The data show two types of deviations from the predicted behavior�
��� Many neutral networks consist of one �or a few� large components and a few
isolated points �or very small components� The overwhelming fraction of the
sequences belongs to the larger components We can attribute the existence of the
isolated points and tiny components to �nite size e�ects
��� The splitting of neutral networks in a small number of large components of
comparable size� however� is not consistent with the prediction of the random
graph model In order to explain this behavior we need to consider structural
features in some more detail A few typical examples can be found in table �
Structures with �u and �p above the threshold values show one� two or four �almost
equally sized� components when they have none� one or two structural elements
that allow to form additional base pairs ��gure ��� These elements are� for ex�
ample� stacking regions with two dangling ends� hairpin loops with �ve or more
members or su&ciently large bulges� internal loops or multi�loops Structures with
these elements are very likely to form the additional base pair whenever comple�
mentary bases are in the appropriate positions This is more likely to be avoided
�and the structure in question is more likely to be formed as minimum free energy
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0λu
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0λp
Figure �� Types of neutral networks in GC��� The color code indicates the �type� of neutralnetworks� depending on the values of �u and �p� grey� networks consisting of morethan � large components� and very small networks� corresponding to less than ����sequences� A component is considered as large in this plot if its size is at least ��� ofthe size of the largest component� Red� one �single� component� apart from �possibly�a few tiny components� green� essentially two components� blue� three components�yellow� four components� If more than one data point falls within a square then thecolor that appears most often is shown�
structure from the corresponding compatible sequence� when there is an excess of
G or C� �)G or �)C respectively� in the base composition of the sequence �the
G�C ratio being signi�cantly di�erent from one� One structural element that
allows to form an additional base pair thus implies a bias towards more G or more
C in the average composition of the sequences just as it is found with the two
components in a two�component neutral network ��gure ��� lhs� Inspection of
table � shows that precisely the same types of structural elements are responsible
for the appearence of two equally sized componenets also in the case of these very
short sequences
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table � Selected Sequences of Components in GC��
Rank Structure �u �p Sequence of Componentsy
� ������������������������������ ����� ����� �������� ������������������������������ ����� �� � �������� ������������������������������ ����� �� �� ������� �� ������������������������������ ����� �� �� �� ��� �������� ������������������������������ ����� �� �� ������ �������� ������������������������������ ��� � �� �� ��� �� ��� ��� � ��
�� � ������������������������������ ����� ��� � ����� �� � ������������������������������ ���� ����� ����� ����� �� �� ����� � ���� � ������������������������������ ����� ����� �� �� �������� ������������������������������ ����� ����� ����� ���� � ��� ��� � ��� ��� �������� ������������������������������ ����� ����� j f�� j � �� � � in ��� components�
���� ������������������������������ ����� ����� j f�� j � ����� in ��� components�
� �� ������������������������������ ����� ��� � j f�� j � ����� in ��� components�
����� ������������������������������ ����� ����� j f�� j � �� � in ��� components�
� A graph of the distribution of component sizes can be found in ����y Very small components are not shown in detail here� A number in square brackets gives thetotal number of sequences in them�Data are taken from reference ����
0 10 20 30C Content
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Fre
quen
cy
component 1component 2
0 10 20 30C Content
0.00
0.02
0.04
0.06
0.08
Freq
uenc
y
component 1component 2component 3component 4
Figure ��� Large neutral networks can decompose into a small number of components withalmost equal sizes which di�er by their G�C ratio� A neutral network with two com�ponenets as exhibited by the structure with rank �see table � is shown on the l�h�s�On the r�h�s� we present a neutral network with four components as found with thestructure of rank ���
Two independent structural elements are superimposed in the relative G�C con�
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
tent and thus we have
�)G* �)G � �)G* �)C � �)C* �)G and �)C* �)C
The �rst and the last combination show a net G or C bias whereas compensa�
tion brings the other two cases back into the middle of sequence space Exactly
this distribution of the four components is found by analyzing the data of the
reference sample GC�� Three components commonly represent a special case of
four components where the two central components happen to be connected
For structures with �u and �p way below the critical values we �nd many com�
ponents and a characteristically decreasing size distribution of these components�
see �gure � as predicted by random graph theory A more detailed analysis of the
relationships between the parameters �p and �u and the particular features of the
corresponding secondary structure graphs will be published elsewhere
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�� Shape Space Covering
Extensive computational studies on RNA sequence structure maps ��� �� �� pro�
vide strong evidence for the existence of a neighborhood �represented as a high�
dimensional ball in sequence space� with a radius much smaller than the chain
length around every random sequence such that this neighborhood contains se�
quences whose structures include almost all common shapes Full enumeration of
a sequence space allows us of course to check this conjecture� and provided it is
true� to estimate the �covering radius�
To this end we have exhaustively generated all sequences within a ball of radius
R in sequence space The covering fraction ��R� r� is de�ned as the number of all
structures with rank not exceeding r encountered within this ball divided by r�
the rank of the rarest structure under consideration
Since the shape space covering conjecture claims that all common structures can
be found within a small radius R�� the covering fraction ��R� rc� is of particular
interest The data shown in the lhs of �gure �� are obtained by averaging
��R� rc� over ��� balls with randomly chosen midpoints
The ��R� r� plots can be �tted very well by an error function
��R� r� ��
�erf �a�R�R���
The parameters R� and a correspond to the threshold value and the slope at the
in�ection point� which measures the sharpness of the transition The choice of the
error�function in �gure �� is arbitrary In order to check the robustness of the
data with respect to the choice of the �tting curve we have used
+��R� �exp� a�R�R�� �
� ! exp� a�R� R�� �
as an alternative The data in table show a very good agreement of the R� values
obtained from � and +�� respectively
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
0 1 2 3 4 5 6 7 8 9 10Radius R
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Cov
erin
g F
ract
ion
1 10 100 1000 10000 100000Rank r
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Cha
ract
eris
tic R
adiu
s
R*
Figure ��� Shape space covering ofGC sequences� The covering fraction �R rc� for sequenceswith di�erent chain length from n � �� �left most curve� to n � �� �right most curve� isshown on the r�h�s� On the l�h�s� we present the characteristic radius R� as a functionof the maximum rank up to which the structures are considered� The plot shows resultsfor chain lengths from n � �� �lowest curve� through n � �� �top most curve��
Table �� Shape Space Covering Radii for GC Sequences
n rc ,R ��� ,R � +�� ��� ,R� +��� ,R��� �� ��� � ����� ����� ������� ��� ����� ���� ��� � ������ ��� �� � �� � ��� � ��� �� ���� ���� ����� ���� ����� ��� ���� ����� ���� ������ ��� ����� ����� ����� ������ ���� �� � �� � ����� ���� �� ���� ����� ���� ���� ������ �� ��� � ��� � ����� ������ ����� ���� ����� ���� ������� ���� ���� ���� ����� �����
The characteristic radius R� will depend on the maximum rank r up to which
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
structures are considered It turns out� however� that only the most frequent
structures have a drastically reduced value of the characteristic radius R�� see the
rhs of �gure �� Since the shape space covering conjecture claims that almost
all of the rc common structures can be found within a fairly small radius we refer
to the characteristic value belonging to rc as the covering radius ,R
Table presents the covering radii ,R for di�erent chain lengths A linear regression
yields
������� �������n � ���� ����� � � �� from ������ � �������n � ����� ����� � � �� � from +��
The slopes of both � and +� vary only slowly with n The data are consistent with
a logarithmic model
������ ������ log�n� ! ������ ������ from ������� ������ log�n� ! ����� ���� � from +��
Thus the transition becomes sharper with increasing chain length� at least if one
considers the scaled quantity Rn In other words� there is a constant �� such
that� in the limit of large chain lengths� we have complete covering if the scaled
radius Rn exceeds ��� while only a negligible fraction of the structures is covered
by balls of size Rn �� The constant �� depends of course on the alphabet in
question For GC the data reported above are consistent with ��GC � ����� �
Alternatively� the covering radius can be estimated by measuring the minimum
distance that is necessary to �nd a given structure from a chosen starting sequence�
and averaging over the starting sequences and target structures weighted by their
preimage sizes An upper bound for this mean covering radius can be obtained
from computer experiment even for larger chain lengths ��� ���
A lower bound for the covering radius can be estimated as the mean number of
base pairs separating an arbitrary sequences from one that is compatible with some
given structure� for a typical structure s with b base pairs ���p�b point mutations
are needed on average to go from an arbitrary sequence x to a compatible sequence
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table �� Lower and upper bounds for shape space covering radius
n AUGC GC AU
�� ��� ��� ����� ��� �� �� ���� ��� ����� ���� ���� ���� ���� �� ������� ���� ���� ��� ��� ���� ����
Data taken from ���
0 20 40 60 80 100 120Chain lenght n
0
5
10
15
20
25
30
Cov
erin
g ra
dius
from enumeration from inverse folding lower bound
Figure ��� Shape space covering radii estimated by di�erent methods for GC sequence spaces�� represents the data from table �� triangles refer to data taken from table ��
x� � C�s�� where p � ��� is the probability that two randomly bases can form a
base pair
The covering radii obtained from the exhaustive enumerations �Table � and the
upper bounds ���� �Table � are in excellent agreement Figure �� shows the chain
length dependence of the covering radii obtained by the di�erent methods The
slight di�erence between the computed slopes for the upper bound� ���� ������
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table ��� Alphabet dependence of ��
Estimate AUGC GC AU
Lower Bound ���� ���� ����Upper Bound ����� ����� ���� ����� ��� �����Enumeration � ����� �����
� Exhaustive computations are not feasible for four�letter alphabets
as compared to ����� from the enumeration results may be due to �nite size e�ects
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�� Conclusions
Exhaustive folding of all sequences of RNA molecules of given chain length n is
usually prohibitive because of the hyperastronomically large numbers of sequences
The only known exceptions are the secondary structures formed by short sequences
�n � ��� derived from binary alphabets �GC andAU� In these cases the numbers
of sequences do not appreciably exceed ��� and there are less than a quarter of
a million di�erent structures Creation of reference samples by exhaustive folding
and retrieval of the desired information� nevertheless� requires special methodo�
logies even in these examples with short chain lengths The technique of tries is
applied successfully and several issues could be readily addressed Examples of
such issues are the elaboration of a clear de�nition of common for RNA secondary
structures� the veri�cation of shape sapce covering and the sequence of components
in neutral networks of RNA secondary structures The shape space covering ra�
dius obtained by exhaustive enumeration agrees very well with a previously derived
upper bound from structure statistics
Exhaustive enumeration is shown to be a highly useful tool for the creation of exact
samples that can be used to verify results obtained from model assumptions and to
prove conjectures derived from statistical approaches It has to be stressed� howe�
ver� that it is limited to short chain lengths and to two letter alphabets Otherwise�
the numbers of sequences certainly are prohibitive for two reasons� computer time
and storage capacities On the other hand� the numbers of �minimum free energy�
RNA secondary structures is certainly not larger in case they are derived from
four letter alphabets We can expect therefore that statistical methods will be
applicable directly to derive results analogous to those of exhaustive enumeration
for GCAU sequences of small chain lengths Investigations along these lines are
under way
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Acknowledgements
The work in Vienna has been supported �nancially by the Austrian Fonds zur
F�orderung der wissenschaftlichen Forschung �Projects No ���PHY and No
�����MAT Support of research in Jena by the Commission of the European
Communities is gratefully acknowledged �CEC Contract Study PSS����� Our
research was also continuously supported by the Santa Fe Institute
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
References
��� Peter Schuster How to search for RNA structures theoretical concepts in
evolutionary biotechnology Journal of Biotechnology� ����� ����� � �
��� W Fontana� D A M Konings� P F Stadler� and P Schuster Statistics of
RNA secondary structures Biopolymers� ����� ������ � �
��� W Fontana� T Griesmacher� W Schnabl� PF Stadler� and P Schuster Sta�
tistics of landscapes based on free energies� replication and degredation rate
constants of RNA secondary structures Monatshefte der Chemie� ����� ��
� � � �
��� Ivo L Hofacker� Walter Fontana� Peter F Stadler� Sebastian Bonhoe�er� Man�
fred Tacker� and Peter Schuster Fast folding and comparison of RNA secon�
dary structures Monatshefte f� Chemie� ������������� � �
��� Peter Schuster� Walter Fontana� Peter F Stadler� and Ivo L Hofacker From
sequences to shapes and back� A case study in RNA secondary structures
Proc�Roy�Soc��London�B� ������ ���� � �
��� Sewall Wright The roles of mutation� inbreeding� crossbreeeding and selection
in evolution In D F Jones� editor� Int� Proceedings of the Sixth International
Congress on Genetics� volume �� pages �������� � ��
��� M Eigen� J McCaskill� and P Schuster The molecular Quasispecies Adv�
Chem� Phys�� ����� � ���� �
�� Edward D Weinberger Correlated and uncorrelated �tness landscapes and
how to tell the di�erence Biol�Cybern�� ����������� � �
� � Walter Fontana� Peter F Stadler� Erich G Bornberg�Bauer� Thomas Gries�
macher� Ivo L Hofacker� Manfred Tacker� Pedro Tarazona� Edward D Wein�
berger� and Peter Schuster RNA folding and combinatory landscapes Phys�
Rev� E� ��������� � �� � March � �
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
���� Peter Schuster and Peter F Stadler Landscapes� Complex optimization pro�
blems and biopolymer structures Computers Chem�� ��� ������ � �
���� Christian Reidys� Peter Schuster� and Peter F Stadler Generic properties
of combinatory maps neutral networks of rna secondary structures Bull�
Math� Biol�� � � submitted( SFI�preprint �������
���� D Sanko�� A�M Morin� and RJCedergren The evolution of �S RNA se�
condary structures Can�J�Biochem�� ����������� � �
���� T R Cech Conserved sequences and structures of groupI introns � building
an active site for RNA catalysis � a review Gene� ����� ����� �
���� Danielle A M Konings and Pauline Hogeweg Pattern analysis of RNA
secondary structure similarity and consensus of minimal�energy folding
J�Mol�Biol�� ����� ������ �
���� S�Y Le and M Zuker Common structures of the �� non�coding RNA in
enteroviruses and rhinoviruses thermodynamical stability and statistical si�
gni�cance J�Mol�Biol�� ������ ����� � �
���� M Zuker and D Sanko� RNA secondary structures and their prediction
Bulletin of Mathematical Biology� ������� ������ � �
���� M Zuker The use of dynamic programming algorithms in RNA secondary
structure prediction In Michael S Waterman� editor� Mathematical Methods
for DNA Sequences� pages �� ��� CRC Press� �
��� H M Martinez An RNA folding rule Nucl�Acid�Res�� ����������� � �
�� � Manfred Tacker Robust Properties of RNA Secondary Structures PhD thesis�
University of Vienna� � �
���� John S McCaskill The equilibrium partition function and base pair binding
probabilities for RNA secondary structure Biopolymers� � ��������� � � �
���� Ruth Nussinov� George Piecznik� Jerrold R Griggs� and Daniel J Kleitman
Algorithms for loop matching SIAM J� Appl� Math�� ���������� � �
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
���� A A Mironov� L P Dyakonova� and A E Kister A kinetic approach to the
prediction of RNA secondary structures Journal of Biomolecular Structure
and Dynamics� ����� ��� � �
���� A A Mironov and A E Kister RNA secondary structure formation during
transcription J� of Biomolecular Structure and Dynamics� ���� � � �
���� M Zuker mfold����
ftp���snark�wustl�edu�pub�mfold�sgi�����tar�Z �Public Domain Soft�
ware�
���� I L Hofacker� W Fontana� P F Stadler� L S Bonhoe�er� M Tacker� and
P Schuster Vienna RNA Package
ftp���ftp�itc�univie�ac�at�pub�RNA�ViennaRNA���� �Public Domain
Software�
���� W Salser Globin messenger RNA sequences � analysis of base�pairing and
evolutionary implications Cold Spring Harbour Symp� Quant� Biol�� ��� ��
� ��
���� Susan M Freier� Ryszard Kierzek� John A Jaeger� Naoki Sugimoto� Marvin H
Caruthers� Thomas Neilson� and Douglas H Turner Improved free�energy
parameters for predictions of RNA duplex stability Proc� Natl� Acad� Sci�
USA� �� ���� ���� Dec � �
��� John A Jaeger� Douglas H Turner� and Michael Zuker Improved predictions
of secondary structures for RNA Proc�Natl�Acad�Sci� USA Biochemistry�
������������ October �
�� � Manfred Tacker� Peter F Stadler� Erich G Bornberg�Bauer� Ivo L Hofacker�
and Peter Schuster Robust properties of RNA secondary structure folding
algorithms In preparation� � �
���� Walter Fontana and Peter Schuster A computer model of evolutionary opti�
mization Biophys�Chem�� ����������� � �
� �� �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
���� Danielle A M Konings Thermodynamic parameters for rna folding Private
Communication� � �
���� Pauline Hogeweg and B Hesper Energy directed folding of RNA sequences
Nucleic acids research� ��������� � �
���� Bruce A Shapiro and K Zhang Comparing multiple RNA secondary struc�
tures using tree comparisons CABIOS� ���� ���� � �
���� Yasubumi Sakakibara� Michael Brown� Rebecca C Underwood� I Saira Mian�
and David Haussler Stochastic context�free grammars for modeling RNA
Report� UC Santa Cruz� � �
���� Y Sakakibara� M Brown� R Hughey� IS Mian� K Sj�olander� RC Under�
wood� and D Haussler The Application of Stochastic Context�Free Gram�
mars to Folding� Aligning and Modeling Homologous RNA Sequences Report�
UC Santa Cruz� � �
���� R W Hamming Error detecting and error correcting codes Bell Syst� Tech�
J�� � ��������� � ��
���� Albert Eschenmoser Hexose nucleic acids Pure and Applied Chemistry�
������ ���� � �
��� J Santa Lucia Jr� R Kierzek� and DH Turner E�ects of GA mismatches
on the structure and thermodynamics of RNA internal loops Biochemistry�
� ����� � � �
�� � I Tinoco Jr� M Chastain� and X Chen Non�watson�crick base pairs and
rna structure Clinical chemistry� ������� � �
���� Heinz W Pley� Kevin M Flaherty� and David B McKay Three�dimensional
structures of a hammerhead ribozyme Nature� ��������� � �
���� Heinz W Pley� Kevin M Flaherty� and David B McKay Model for an RNA
tertiary interaction from the structure of an intermolecular complex between
a GAAA tetraloop and an RNA helix Nature� ������������ � �
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
���� M S Waterman Secondary structure of single � stranded nucleic acids
Studies on foundations and combinatorics Advances in mathematics supple�
mentary studies Academic Press N�Y�� ����� � ���� � �
���� IL Hofacker� P Schuster� and PF Stadler Combinatorics of secondary
structures Submitted to SIAM J� Disc� Math�� � �
���� GK Zipf Human Behaviour and the Principle of Least Eort Addison�
Wesley� Reading�Mass�� � �
���� YS Chen Zipf�s law in text modeling Int� J� General Systems� ������� �
���� Benoit B Mandelbrot The Fractal Geometry of Nature� Freeman * Co�
New York� � �
���� AV Aho� JE Hopcroft� and JD Ullman Data Structures and Algorithms
Addison�Wesley� Reading� MA� � �
��� D E Knuth The Art of Computer Programming Volume �� Sorting and
Searching Addison�Wesley� Reading� MA� � ��
�� � Walter Gr�uner Evolutionary Optimization on RNA Folding Landscapes PhD
thesis� Inst of Theoretical Chemistry� Uni Vienna� Austria� June � �
���� Ivo L Hofacker A Statistical Characterisation of the Sequence to Structure
Mapping in RNA PhD thesis� University of Vienna� � �
� � �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
Table of Contents
� Introduction �
� RNA Secondary Structures �
�� RNA Folding �
�� Representation of Secondary Structures �
�� Folding as a Combinatory Map �
�� Compatibility of Sequences with Structures
�� Neutral Networks ��
��� Rearranging the set of compatible sequences ��
��� De�nition of neutral networks ��
��� Neutral networks of RNA secondary structures ��
��� Random graph theory of neutral networks ��
� Exhaustive Enumeration of Secondary Structures ��
�� Folded Secondary Structures ��
��� An upper bound to the number of folded structures ��
��� A lower bound to the number of folded structures ��
��� Exhaustive structure generation �
�� Numerical Estimates From Exhaustive Enumeration ��
��� The number of minimum free energy structures ��
��� Common structures ��
�� The Distribution of Preimage Sizes ��
� The Sequence of Components ��
�� Algorithmic Considerations ��
��� Tries ��
��� Connected component analysis �
��� A remark on programming methodology ��
�� A Sample Session� GC�� ��
� i �
Gr�uner et al�� Analysis of RNA Sequence�Structure Maps
�� Connectivity in GC�� ��
� Shape Space Covering �
� Conclusions ��
Acknowledgements ��
References ��
� ii �