Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs...

139
UNIVERSITAT P OLIT ` ECNICA DE CATALUNYA DEPARTAMENT DE F ´ ISICA APLICADA P H.D. T HESIS Evolution and Dynamics in Information Networks Sergi Valverde S UPERVISOR Ricard V. Sol´ e November 30, 2005

Transcript of Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs...

Page 1: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

UNIVERSITAT POLITECNICA DE CATALUNYA

DEPARTAMENT DE F ISICA APLICADA

PH.D. THESIS

Evolution and Dynamics in InformationNetworks

Sergi Valverde

SUPERVISOR

Ricard V. Sole

November 30, 2005

Page 2: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...
Page 3: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Abstract

This thesis explores the generic features (dynamical and topological) of complex networkswhere information and/or transport processes play a key role. We have analysed theiremergent properties, such as the existence of self-similar fluctuations near critical pointsor the emergence of scale-free structures in interaction networks. We have suggested theexistence of some basic laws of evolution of complexity reflected in both the architectureand the observed dynamics.

Page 4: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

4

Page 5: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Acknowledgments

Esta tesis esta dedicada a mi familia: a la nueva y a la de siempre. En un lugar especial deestos agradecimientos se encuentran mis padres Blai y Estefania y mis hermanas Monica yEster. Sin ellos no hubiese existido el lugar y el tiempo donde crecieron mis esperanzas ymis suenos. Durante muchos anos mi padre escucho pacientemente mis ideas y proyectosmas extranos sin rechistar, incluso cuando la idea era imposible de comprender (algo que,por otra parte, sucede con mayor frecuencia de lo deseable). Mi otra gran aliada ha sidomi madre, que permitio que llenase mi habitacion de graficos y diagramas igualmenteincomprensibles. Se da el extrano caso que mi madre reconoce con precision cada unade estas figuras, lo que la convierte practicamente en co-descubridora de los hallazgospresentados en esta tesis. Agradezco a Monica largas discusiones sobre la vida y susatavares, sentados en el sofa de casa. Debo confesar que sus consejos me han servido deinspiracion en mas de una ocasion. He visto crecer con gran orgullo a Ester, la personamas vital, internacional y alegre que jamas se haya conocido.

A mes a mes de molt d’esforc i grans dosis d’entusiasme, una serie d’aconteixementsfortuıts van fer possible aquesta tesi. Possiblement, l’accident mes rellevant va tenir llocel mes de setembre del any 1991, el dia que vaig assistir a la meva primera classe ala facultat d’informatica de Barcelona. Aquell dia, un professor molt jove de fısica deprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure enaquell professor tot allo que per a mi significa un autentic cientıfic. Curiosament, aquelljove va esdevenir anys mes tard el meu director de tesi, padrı de boda i germa adoptiu (ino necessariament en aquest ordre). Cal agrair o culpar a en Ricard Sole per aquesta tesi,que es podria haver llegit en un mon paral·lel on l’autor de la tesi seria un programadorpal·lid en un edifici gris i metal·lic. En el nostre mon, en Ricard va rescatar a l’autor de lesurpes de la programacio, el va acompanyar en moments difıcils i foscos i el va introduir ala recerca.

Tanmateix, la meva experiencia com a programador m’ha servit a la meva recerca.Tot va comencar amb els meus amics de sempre, en Javi, Ignasi, Raul i David. Ells vancompartir amb mi l’epoca dels 8 bits i de l’immortal MSX, sempre programant amb elsmicroordinadors i planejant projectes impossibles. N’estic molt agraıt als meus companysd’universitat i de feina a el Periodico de Catalunya, Paco, Salva i Jorge, per compartir ambmi discussions i lectures critiques sobre el disseny i la programacio orientada a objectes(ojala pudiesemos encontrarnos mas a menudo). Mes tard vaig passar alguns anys totrealitzant un dels meus somnis, la creacio de videojocs. Durant aquest temps vaig recollir

Page 6: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

6

les dades per al nostre primer treball de xarxes de software. Demano perdo a els meuscompanys i amics a Ubi Soft, A. Martens, A. Rodriguez, J. Andreu, Javi Capel, J. Pare-des, J. Renato, Joan, Luıs, Pac, Toni, Victor i Xavi, per suportar pacientment les mevesdiscussions filosofiques sobre el desenvolupament de software. Ells tambe van contribuiral meu treball amb les seves observacions.

Aquestes teories van neixer per respondre questions del mon real i mes tard es vandesenvolupar i madurar a l’entorn (certament envejable) del Complex Systems Lab. Aquıhom pot trobar excel·lents cientıfics i companys de recerca (in no special order): Monty,Fede, Javier, Josep, Bernat, Andreea, Carlos, Harold, Pau, Martı, David, J. Gamarra, andJ. F. Sebastian. Les discussions que he tingut amb ells m’han servit per millorar i percontemplar d’altres punts de vista als que sovint paro menys atencio del que es necessari.Desitjo que moltes i bones col·laboracions surtin entre els membres d’aquest grup i l’autorde la present tesi.

Pero aquesta tesi no s’hagues completat sense l’amor i el suport de la meva dona.Diuen que hi ha un sol de mitjanit. Jo crec que el vaig veure ja fa tres anys, durant una nitde revolta a Barcelona. Vull que la teva llum m’acompanyi sempre, Susanna, fins al final.Tot just ara comenca el nostre gran projecte, el nostre fill, el nostre estel.

Page 7: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Contents

1 Introduction: evolution and design 91.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Selection, constraints and landscapes . . . . . . . . . . . . . . . . . . . . 111.3 Natural and artificial designs . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Complex networks: linked 192.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Topological patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Regular versus scale-free graphs . . . . . . . . . . . . . . . . . . . . . . 212.4 Evolution of software networks . . . . . . . . . . . . . . . . . . . . . . . 23

3 Information dynamics 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Self-organized Internet traffic . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Path horizon and network topology . . . . . . . . . . . . . . . . . . . . . 30

4 Articles 354.1 Selection, Tinkering and Emergence in Complex Networks . . . . . . . . 354.2 Information Theory of Complex Networks: On Evolution and Architec-

tural Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3 Information Transfer and Phase Transitions in a Model of Internet Traffic 704.4 Self-organized Critical Traffic in Parallel Computer Networks . . . . . . 824.5 Internet’s Critical Path Horizon . . . . . . . . . . . . . . . . . . . . . . . 964.6 Scale-Free Networks from Optimal Design . . . . . . . . . . . . . . . . 1054.7 Network Motifs in Computational Graphs: A Case Study in Software

Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.8 Logarithmic Growth Dynamics in Software Networks . . . . . . . . . . . 121

5 Summary of main results 129

6 Glossary 133

Page 8: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

8 CHAPTER 0. CONTENTS

Page 9: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Chapter 1

Introduction: evolution and design

1.1 Introduction

Efficient exchange of energy, matter and information is an important feature of many nat-ural and artificial systems (Inose [1972]). The functioning of our modern society is sup-ported by several networks, one for each of the three aspects commented before: powergrid (energy), transport networks (matter) and communication networks (information).When looking at biological systems, similar trends are seen in their organization. Celldynamics and stability is also based on three key types of networks. In this case transportis dominated by molecular processes taking place at the surface of the cell (due to trans-membrane proteins selectively choosing what molecules enter or leave the cell). Energyis transported and processed by sets of connected proteins anchored to external and in-ternal membranes (able to take light or chemical energy and transform it in appropriatechemical carriers). And finally, information is present at multiple levels, from cascadesof signals sending messages from the cell surface to the nucleus to the information storedin the DNA.

Information is the main player in both man-made and the natural worlds. The rea-sons for this relevance of information seem different though. Biological systems not onlytake energy and information from an external world: they adapt to it by responding withwell-defined computations. The ultimate goal of such structures is to survive and eventu-ally reproduce. There is no equivalent of adaptation or reproduction (not to be mistakenwith replication) in the physical world, since “fitness” has little relevance to it. Andyet, as will be shown in this dissertation, man-made systems often share common traitswith biological designs. Beyond their enormous differences, new patterns found in thetopological organization of complex systems seem to indicate that generic features mightpervade their global organization, beyond their characteristic features and even functionalmeaning. Statistical physics is here one of our most powerful guides: by looking at thegeneric, local features of a complex system, we hope to be able to find what types ofglobal organization can emerge (Rodriguez-Iturbe and Rinaldo [1997]).

The work described here is a first contribution to a larger vision (Sole et al. [2002a]).

Page 10: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

10 CHAPTER 1. INTRODUCTION: EVOLUTION AND DESIGN

BA

Figure 1.1: Tree-like systems are generated by different types of evolutionary rules, oftenlinked to optimization processes. Here we show (a) a model of a tree-branched networkassociated to perfusion of a gas and (b) a river network resulting from an optimizationprocess (Rodriguez-Iturbe and Rinaldo [1997])

We want to understand the origins of complexity in both natural and artificial systems.The new, emergent field of complex networks offers an extraordinary opportunity to for-mulate and perhaps answer old, fundamental questions on how complexity emerges.

By providing a quantitative basis to the global and local organization of complexsystems, networks become the appropriate framework in which key questions becomewell-defined. Our final goal is to understand the contributions of selection (natural orartificial), emergence and tinkering to the final outcome of any process resulting in acomplex system and how dynamics and structure interact with each other. Many questionscan be formulated in the previous context. Some of them are:

• Are complex systems in artificial designs the result of some class of optimization?

• Are constraints emerging from dynamical and structural limitations to what is pos-sible a main player in defining what is possible?

• Is tinkering and reuse an exclusive feature of biological entities or a much moregeneral component of complexity?

• How is topology and function related to each other?

• Is functionality reflected in the architecture of interactions?

• What is the role of constraints in shaping network architecture?

• What are the underlying dynamical rules that shape the global patterns?

• Is computation a common feature that pervades the commonalities seen in the net-work structures that are found in natural and artificial computing systems?

Page 11: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

1.2. SELECTION, CONSTRAINTS AND LANDSCAPES 11

• What is the fitness landscape of technology?

• Are historic events relevant at all in technological evolution?

• How is information processed in complex systems under noisy conditions?

• What is the outcome of information dynamics in systems with conflicting con-straints among users and the global systems dynamics?

These are questions that pervade our research agenda. Some of them are presentedhere in a well-defined way and a few are answered. Our exploration involved the analysisof essentially (but not exclusively) two well-known systems largely responsible for themaintenance of our modern society: the Internet and large-scale software systems. Thereasons for choosing such systems are somehow obvious: they are both complex techno-logical systems, with a very complex structure and dynamics, and allow quantifying theirarchitecture at multiple levels and often with a high degree of detail. Internet has beenan important subject of exploration within the statistical physics community over the lastyears (Pastor-Satorras and Vespignani [2004]). It is a major technological innovationlargely responsible for the current information era, and its dynamics can be understoodin terms of collective behaviour. But software systems are not less relevant. Actually,they pervade all technological innovations and are strongly tied to the stability and per-formance of all other technological systems. There is no way to properly control andmanage a complex network of airports, energy systems or information-processing webswithout the constant input provided by a complex software structure. Software graphssomehow contain the blueprints of the complexity of the systems they control. Under-standing their topology and how it originates provides a very valuable picture of networkcomplexity in its most rich framework.

Exploring the origins of complexity in networks would be a rather difficult task with-out a view of complexity explicitly considering evolutionary rules. In this context, ourperspective is not limited to the powerful tools of statistical physics but also to the longtradition of evolutionary biology. Natural evolution proceeds in time through a combi-nation of historical events (contingency) together with selective forces acting on top ofstrong dynamical and structural constraints. What is especially relevant in this context(see below) is that evolution proceeds through extensive re-use of the available compo-nents, and thus innovation strongly relies on such an apparently inefficient mechanism.As will be shown below, in spite of the apparently great differences between technologi-cal and biological design, common mechanisms are actually shared by both, perhaps at avery fundamental level.

1.2 Selection, constraints and landscapesWhen dealing with evolving and adapting complex systems, there is a need of moving be-yond standard statistical physics towards a broader picture involving far-from equilibrium

Page 12: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

12 CHAPTER 1. INTRODUCTION: EVOLUTION AND DESIGN

dynamics. What is more important: most relevant complex systems (particularly biolog-ical ones) perform some sort of computation defining a process of information dynamicsat different scales. How evolution shaped these mechanisms of information processingor how they are shaped by technological design can be understood in terms of some typeof optimality process. By looking (intentionally or not) to a given optimal solution, thepaths that build a given structure need to follow some process of selection.

Darwin’s theory of evolution by natural selection provides a powerful explanation ofhow order appears in nature. In a population, a process of naturally occurring variationtakes place under limited resources. Variation is present in the offspring of a given gen-eration: the new individuals are similar to their parents but also exhibit some range ofvariability in their defining traits (caused by small genetic differences). In a limiting en-vironment (where food and/or space are finite) not all individuals can survive and thusthose having a large chance of reproduce or survive will succeed, while the others willdie. Those “less fit” individuals are thus removed from the population by the externalenvironment.

Darwin’s view provides a picture of life as a great branching tree, being living speciesthe tips of the twigs and constantly pruned by natural selection. Such “tree of life” revealsa hierarchical organization of evolutionary events. Species are linked together within ahistorical picture, with a common trunk indicating a common descent. Relations amongspecies in such tree are based on structural similarities and the actual classification system(including species, genera, families etc) fully reflects such hierarchy. Two fundamentalmeasures of similarity can be defined when comparing morphological traits. One is basedon analogy, and involves structures performing the same function but having differentevolutionary origins. An example would be the wings of birds and bats. The second iscalled homology. It refers to structures with common origin (such an arm and the wing ofa bird) but having different functions. Under Darwin’ view, homology provided a markerof common origins. One the other hand, analogy is the result of convergent evolution:unrelated organisms can develop similar structures to be used for a common function(such as wings for flight).

The relevance of the two previous concepts becomes clear when thinking about themechanisms driving the evolution of complex structures. It has been shown that manykey structures found in nature are actually the result of independent events. Societies,flight, visual organs or multicellularity for example are reinvented several times throughthe evolutionary history of completely unrelated groups. Often, the solutions found turnto be very close in spite that the underlying building blocks are clearly different. A similarcase can be made for technological graphs, particularly software evolution. Once again,the evolution of software designs, particularly languages, is a man-made process. But thesimilarities are compelling: an evolutionary tree of computer languages reveals a patternof ”speciation” (languages diversify through changes in their structure, adapted to newtechnological contexts), extinction or even processes of symbiosis.

One successful approach to understand large-scale evolution of complex systems isbased on so called fitness landscapes, an intuitive idea first introduced by Sewall Wright(who used the word adaptive instead of fitness landscape). The basic idea is that single

Page 13: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

1.2. SELECTION, CONSTRAINTS AND LANDSCAPES 13

Figure 1.2: Fitness landscapes. Left: Three different types of simple, two-dimensionallandscapes are shown. In this simplified picture, two traits define two continuous vari-ables to be placed along the two axes. The height provides some (continuous) measureof fitness. Different numbers of peaks are present, associated to increased amounts ofruggedness. When a single peak is involved (a) a unique optimal solution is eventuallyfound by all paths. If the landscape becomes more complex, now involving a few peaks,a limited number of different peaks are reachable. When the landscape becomes veryrugged (c) different initial conditions, even if close, typically end up in different peaks.The right picture shows the result of an evolutionary experiment performed by Karl Niklasusing a model of plant architecture with constraints (see text). The upper part of the plotshows five different patterns found when two specific constraints are used, whereas thelower part (involving seven examples) shows a sample of the large number of plant struc-tures found when three constraints are considered.

species can be characterized in terms of a string of genes defining the genotype. Stringshave an associated (usually real) number. This number is the fitness of the string interms of the phenotype it produces and the distribution of fitness values over the space ofgenotypes defines the fitness landscape. Adaptation is then thought of as a process of “hillclimbing” towards higher, nearest peaks. We can imagine a simple situation (figure 1.3a-c,left) where the fitness only requires the specification of two traits, whose characteristicsare measurable (the X and Y axis of the plot). Imagine that these two characteristicsdescribe the shape of the organism and that different combinations are allowed. Thefitness landscape gives us an idea of how optimal are these combinations, and for a givenfixed environment a number of peaks corresponding to best fit combinations are expectedto be present.

As the landscape becomes more and more peaked (rugged) multiple sub-optimal so-lutions become available. Actually, theory predicts that the ruggedness of a fitness land-scape strongly depends on the number of conflicting constraints operating on the systemunder consideration (Kauffman and Levin ; Kauffman [1993]). Specifically, as the num-ber of dependencies between different sub-parts increases (the so called epistatic interac-

Page 14: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

14 CHAPTER 1. INTRODUCTION: EVOLUTION AND DESIGN

LispSnobolPascalProlog SchemeModula-2Algol-wSimula-67Simula-IAlgol-60Algol-58Fortran IIFortran IFortran IVFlow-MaticComtranCobol BasicCPLAlgol-68PL/IBCPLBCFortran-77 Small talk-80Fortran-90ANSI C C++ Java Ada Common Lisp19571958195919601961196219631964196519671968196919701971197219751977198019841986198919901995

ModularitySubroutinesCompilerHierarchical AbstractionObject-Oriented ProgrammingSoftware EngineeringStepwise RefinementInformation HidingSystem StructureAbstract DatatypesSystem FamiliesModel-View-ControllerParametrized TypesFrameworksDesign PatternsRefactoringInternet Programming(a) (b)

Figure 1.3: (a) The evolution of major computer programming languages and (b) the Dar-win’s tree of evolution. Both technology and biology share common patterns of changethrough time.

tions) the number of peaks increases and their relative height decreases: optimal solutionsare less optimal if multiple requirements have to be satisfied simultaneously. This prop-erty is strongly tied to frustration in spin systems. An important consequence of theseproperties is the presence of strong limitations associated to the optimization of givensub-systems: a given module in a technological system or a given property of a biologicalentity might have a theoretical optimum when considered in isolation, but they have tofunction inside a more complex system made up of multiple parts. This is a very impor-tant implication is the possible presence of global constraints to the overall architecture.Effectively, such constraints might strongly limit the possible set of available structures.

A particularly neat example of the applicability of fitness landscapes to evolution andthe role of multiple constraints is the work of Karl Niklas on the patterns of morphologicaldiversification in plants (Niklas [1994]; Niklas [1997]). This work involves a comparativeanalysis of fossil plant phenotypes with simulated forms resulting from adaptive walks ona fitness landscape. The available set of forms is called a morphospace, the domain of allconceivable phenotypes (McGhee [1999]). Each point in this (multidimensional) spacehas some assigned fitness. To define a realistic morphospace is not easy. Any choice willcontain some number of arbitrary assumptions. A morphospace is a context-dependentstructure: relative fitness depends on environmental constraints.

The phenotypes defined by Niklas models introduce a small number of highly relevantfeatures, such as mechanical properties or the ability to capture light and nutrients. Theseproperties lead to epistatic contributions to fitness: mechanical constraints, for example,will alter the pattern of ramification and thus the capacity of gather light. As predictedby NK models, an increasing number of functional obligations leads to an increasing

Page 15: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

1.3. NATURAL AND ARTIFICIAL DESIGNS 15

number of phenotypes with similar relative fitness (and thus to a larger number of adaptivepeaks/attractors). The morphospace used by Niklas is three-dimensional. It involves threetasks: light harvesting, mechanical stability and reproduction (Niklas [1997]). The scalarparameters are: (a) p, the probability of branching termination (which is a measure ofbifurcation frequency); (b) γ: the rotation angle of the branch and (c) φ the branchingangle. For each set (p,γsφ), a fitness function F(p,γsφ) is defined. In Niklas analysis,the three functional obligations (growth, survival and reproduction) can be quantified bymeans of closed-form equations derived from biophysics and biomechanics.

As it is illustrated in figure 1.3 (right), the final result of many adaptive walks onthis landscape will depend on the number of tasks considered. If only mechanical re-sistance is chosen then a single peak (global optimum) is available and a unique mor-phology is reached. If two tasks are considered simultaneously, then several peaks areaccessible through the walks and an increasing diversity of plant forms is obtained. Ifthe three constraints are introduced simultaneously, the whole spectrum of vascular landplant morphologies is obtained. In spite of the limitations implicit in this approximation,the outcome of the multi-task scenario is spectacular. On the other hand, early evolutionof plants on land took place on a basically uninhabited landscape and thus it was pre-sumably driven by physical laws. And perhaps more importantly, this work shows that“organic complexity may not impose the severe limits on evolution that are sometimesenvisioned” (Niklas [1997]). Instead, it suggests that the evolution of complex organisms(and indirectly that of ecosystems) may lessen the burden of climbing the adaptive peaksof a fitness landscape, by means of the multiplicity of accessible, low-fit peaks.

As will be shown in this dissertation, the presence of strong constraints to the build-ing process of a complex technological graph (such as software designs) is likely to bestrongly canalized by both evolutionary rules and the presence of multiple, conflictingconstraints.

1.3 Natural and artificial designsSince artificial networks are the main objectives of this thesis, it is particularly importantto stop here and rethink the possible role of biological evolution in this context. Can wereally compare the two types of designs? One fundamental difference between techno-logical and biological design was early indicated by the French biologist Francois Jacob(1976) in his influential paper “Evolution as Tinkering”. As discussed by Jacob, a funda-mental difference in this context is the presence of a top-down design coming from thegoals pursued by the engineer, who foresees the future. There is an intentional action herethat is not shared by any structure generated by biological evolution: man-made artefactsare designed with a purpose associated to their final function, whereas natural structuresare not designed in any intentional way. It is reasonable to think in the engineer as tend-ing to approach the highest level of “perfection” compatible with the searched goal andthe technology available. Evolution, in this context, might be unable to reach optimalsolutions. Such claim is actually made by Darwin in the Origin of Species. We have to

Page 16: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

16 CHAPTER 1. INTRODUCTION: EVOLUTION AND DESIGN

conclude that natural selection does not work as an engineer, but as a tinkerer, who knowswhat is going to produce (since some given problem has to be solved) but is limited bythe constraints present at all levels of biological organization .

The presence of tinkering has been recognized at multiple scales in biology. An ex-ample is given by the widespread re-use of genes and sets of genes at different stages ofdevelopment: when building an embryo, some given (small) groups of interacting genesare used and re-used several times. Small modules of genes are thus a toolkit for develop-ment, and we can find them acting during the formation of limbs, the building of guts andin defining boundaries in neural structures. Although these are very different structuresperforming very different functions, the building blocks are the same: the classes of dy-namical states or transients that they are able to generate are the appropriate ones requiredfor each event.

Examples of artificial designs are electronic circuits (figure 1.4). These are especiallyinteresting examples because they seem to illustrate the relative importance of optimiza-tion versus constraints in artificial design. Electronic designs do perform a function andtheir building process involves a cost: engineers (or the programs used to design) try toreduce the wiring as much as possible. However, although such goal is easily achievedfor small circuits, once we cross some complexity threshold things are much harder. Acomplex circuit can be built in many different ways, but the difficulties of finding anoptimal design performing a given function rapidly increase as we move from small tolarge systems. Actually, one simple and efficient solution to escape from the constraintsderived from such search is to use pre-specified modules performing some given, but flex-ible enough functions. In this way chips become the new basic unit by means of whichelectronic design can be made easy and achieve complex functions.

The interesting point here is that, when we look at the outcome of evolution or designforces we often end up with some common features, sometimes surprisingly similar be-tween nature and man-made structures (Ferrer i Cancho et al. [2001]; Sole et al. [2002a]).Before the new approaches to complex networks, such similarities were unnoticed, giventhe difficulties arising from comparing so different and complex systems such, namely,the genome and the Internet. With totally new theoretical tools and concepts, we can nowextract a large number of regularities at multiple scales and build reasonable conjecturesabout their possible origins.

Among the original research work contained in this dissertation, we present some ofour work on software maps and their highly heterogeneous architecture. By carefullyexploring their patterns of organization, we have found evidence for tinkering as a mainplayer in shaping software structures at the global level. Following this view, softwaremaps should be understood as life-like entities shaped by the distributed work of manydesigners that know where to go but are also constantly making re-use of previous struc-tures, thus converging towards tinkering. The multiple conflicting constraints of the de-sign process, together with further constraints derived from the multiplicative characterof the “copy-and-paste” process are actually the two key ingredients to explain the emer-gence of complex computational graphs. Understanding and measuring the patterns andprocesses underlying them requires an appropriate formal framework, presented in the

Page 17: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

1.3. NATURAL AND ARTIFICIAL DESIGNS 17

Figure 1.4: Electronic circuits are one of the most obvious examples of man-made arte-facts where design is (in principle) the leading force. Optimization for low cost (in termsof wiring) is present, although the complexity of large-scale systems also imposes limita-tions to design. In fact, once the desired design surpasses a given complexity threshold,engineers widely use (re-use) available modules to link them together in order to performthe desired functions.

following chapter.

Page 18: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

18 CHAPTER 1. INTRODUCTION: EVOLUTION AND DESIGN

Page 19: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Chapter 2

Complex networks: linked

2.1 Introduction

How to describe the architecture complexity in quantitative terms? Beyond their differ-ences, all complex systems are typically formed by a large number of interacting units.Out of their interactions, new properties emerge that cannot be reduced to the propertiesof their individual components in isolation. Complexity is thus intimately tied to net-works: to describe a complex system, as a first step, involves defining the map of unitinteractions. In order to provide a basic framework for the methods and definitions usedin the collected papers presented here, we will introduce some basic definitions involvingnetworks and network properties.

The oldest man-made networks involved the transport of matter and energy. Passen-gers and goods moved using transportation networks and electricity soon became a keyelement for the emergence of modern cities. Today, the power grid ensures stable energyresources to millions of users and other visible and invisible nets are covering the entireworld. Information networks have been particularly relevant in communication throughthe XXth century. These webs, once in place, rapidly increase in size and complexity.They stimulate parallel growth of new social and economic activities, which in turn im-pose increasing demands on the network (Inose [1972]; Standage [1998]). Research intonetwork theory was early triggered by the rise of the telephone. But it was the emergenceof the Internet, together with the development of bioinformatics and genome biology whateventually forced the emergence of a fully developed theory.

Perhaps the most familiar network to all of us is the web of streets in a city. Infigure 2.1a we show an example of it: part of the streets of the French city of Arles. This isa spatially-extended system, where the intersections among different streets (continuouslines) are indicated my means of small circles. For comparison, we also plot a graphthat has no spatial structure (figure 2.1b). Here we build the web by using samples ofwritten texts. The elements now are words, and they are considered to be “linked” if theyappear next to each other within at least one sentence. The order is an important featureof sentence structure and thus it is indicated by means of an arrow (since the reverse link

Page 20: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

20 CHAPTER 2. COMPLEX NETWORKS: LINKED

caput

esdomine

et

tua

non

est

te

quia

tamen

tu

ut

ad

cor

in

mihi

sed

enim

autem

qui

mea

meum

cum

me

deusmeus

quo

fecit

caelum

terram

quod

terra

quae

esset

ego

quid

iam

etiam

nam

ex

ea

sunt

de

a

si

ecce

abs

ab

misericordia

usque

erat

eis

hoc

memoria

caeli

se

esse

id

neque

illo

eo

his

factum

anima

noster

solum

ideo

principio

incomposita

Figure 2.1: Examples of standard graphs: In (a) the street network is undirected andunweighted. In (b) the language network is directed, but links are simply either presentor absent.

might be forbidden). The resulting word web is clearly an abstract object, with no obviousspatial meaning. We can perceive a big difference in relation with the previous graph: nowa few elements have many links whereas most have just one. As will be shown here, sucha heterogeneous pattern in link numbers is the rule, rather than the exception, in the realworld of information networks.

2.2 Topological patternsThe structure of these webs can be described at different levels of detail. The two previouswebs are clearly different in one fundamental way: the first can be placed in a geometricalcontext, whereas the second is not. Actually, we can always consider a topological view.Take for example the map of streets of a city and the problem of how to move from agiven corner to another. Such problem only requires the correct indication of the relativeposition of the nodes. This is particularly clear when using underground maps. They arehighly inaccurate, since typically the real distance among stations is misrepresented infavour of a compact view of the relative position of stations through the map. Using thesemaps, it is possible to determine how to get to a given station following a given path. Themap is in that sense accurate in topological terms.

By topology we refer to the study of the properties of objects that remain unchangedwhen so called topological transformations are performed on it. An example of suchtransformations would be to stretch and fold the surface on which the street map is drawn.Although geometrical features (such as relative Euclidean distances) become altered, theweb itself and the sequence of nodes defining different paths remain the same (mathemat-ically, it is said they are topologically invariant).

When other network features must be taken into account, such as dynamical patterns

Page 21: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

2.3. REGULAR VERSUS SCALE-FREE GRAPHS 21

dcba

fe

Figure 2.2: Deterministic homogeneous networks: six examples of such webs are shown.All of them share a common property: all nodes have exactly the same number of links(i. e. same degree). Here we plot the following graphs: (a) square grid (z=4), (b) closedring (z=2), (c) tree (z=3), (d) complete graph (z=5), (e) three-dimensional cube (z=3) and(e) four-dimensional hypercube (z=4).

of flow through the links or their cost, further details enter into the web description. If weconsider an electronic circuit or a neural network in the brain, where the wiring is costly,the physical distance between elements has to be taken into account, as well as theirspecific spatial location. If dynamics is under consideration, further layers of complexitymust be introduced. In this chapter, topological patterns will be our main objective.

2.3 Regular versus scale-free graphs

From the point of view of evolutionary dynamics and universal patterns, the analysisof web structure allows to formulate new questions: Which networks can be found innature? Are all possible types of network found in the real world? As discussed in theprevious chapter, questions involving the contribution of different mechanisms shapingbiological complexity need to be answered by using an appropriate framework. As willbe shown in the papers accompanying this dissertation, powerful answers can be obtainedby analysing the topological patterns of interactions exhibited by real webs. They actuallyreveal a rather well defined number of regularities suggesting some common principles intheir evolution (as well as important differences).

Here, we propose a common and unifying perspective on information networks. Itis clear that power grid, the Internet or road networks are different in many ways, i.e.,in their purpose, utilization or design process. However, all these systems can be under-stood from the same perspective. With independence of specific details regarding eachsystem, we can recover its underlying topology (network structure). A network is a math-ematical object consisting of a set of nodes and a set of links that describes structuralpatterns between different system parts and ignores many specific details, like physicalshape. Nodes represent individual elements and links represent interconnections between

Page 22: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

22 CHAPTER 2. COMPLEX NETWORKS: LINKED

these elements. For instance, the physical Internet accepts a network representation wherenodes correspond to Internet routers and edges represent physical wires connecting pairsof routers (Pastor-Satorras and Vespignani [2004]).

Some classical examples of highly regular (homogeneous) networks are shown in fig-ure 2.2 where every element has the same number of connections with others. Althoughthese lattice-like structures received considerable attention from the mathematical com-munity, they are far from representing most real networks. Actually, they belong to anextreme in a wide range of possible structures going from totally deterministic and regu-lar to completely random ones. One of the key findings of the last years is the observationthat some special features displayed by real graphs, such as the so called small worldeffects (Watts and Strogatz [1998]) seem emerge at some intermediate zone between theregular and the random.

Classical views of random graphs were largely dominated by the work of Paul Erdos.In a random graph, any pair of nodes is linked with a fixed probability p. Random graphsare highly regular. This homogeneous structure is characterized by measuring its degreedistribution P(k), or the probability of a given node having k links to others. It was shownthat degree distribution P(k) for a random graph with N nodes and link probability pfollows a binomial distribution:

P(k) = CkN−1 pk(1− p)N−1−k

The degree distribution is strongly peaked at the average degree:

〈k〉 =N

∑k

kp(k) = pN

For large N, a Poisson distribution replaces the above binomial:

P(k) ≈ e−pN (pN)k

k!

Real graphs are also known to be extremely heterogeneous, displaying a high degree ofvariation in terms of their degree distribution. In particular, most of them seem to includea small number of elements (the hubs) having many connections with others, whereasmost elements have just one or two links with others. These type of networks have beennamed scale-free (SF) and P(k) follows a scaling law:

P(k) ∝ k−γ

The scaling exponent is typically constrained to 2 < γ < 3(Dorogovtsev and Mendes[2003]). SF networks are known to display some unexpected statistical features. Lookingat the moments of the degree distribution, i. e.,

Mµ =∞∫

1

kµ P(k)dk

Page 23: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

2.4. EVOLUTION OF SOFTWARE NETWORKS 23

(with m=1,2,...) and assuming that P(k) ∼Ak−γ , it is easy to show that average degree iswell defined, leading to <k>=(γ -1)/( γ -2), whereas the higher moments are not, sincethey scale as

Mµ = kµ−γ+1

and thus Mµ → ∞ for µ≥2. Fluctuations are thus extremely important and have beenshown to be the key for understanding a number of key features exhibited by SF architec-tures. This is the case for example of the spreading of computer viruses on the Internet(Pastor-Satorras and Vespignani [2004]).

2.4 Evolution of software networksHow do SF nets originate? There are a number of well-identified processes leading to SFstructure. Most of them rely in a growing network displaying some rules of preferentialattachment of new nodes. However, it has been suggested that a sparse SF network canactually result from an underlying optimization process in which efficient communicationat low cost is involved (Valverde et al. [2002]). But the most interesting implications fromSF architecture are related to their high robustness against random node failure, togetherwith a high level of fragility when hubs fail. In other words, information transfer keepsworking in an efficient way when a randomly chosen node fails but typically degradeswhen a highly connected component fails. Such observation has been shown to haveimmediate implications for reliable network architecture. Since system’s sensitivity tocomponent failure is a fundamental problem in any area of engineering, it is important torecognize how network topology will influence system’s performance.

An example of such type of distribution for real software networks is shown below,together with an example of a small graph, where the hubs can be clearly appreciated,together with a large number of elements having a single connection.

Such scale-free graphs are the result of multiplicative processes. An example is the socalled preferential attachment rule (Barabsi and Albert [1999]). But heterogeneous archi-tectures can also be a consequence of a pressure towards achieving good communicationat a low cost. We have provided empirical and theoretical evidence of this principle in sys-tems studied in this thesis. What is more important in our context: rules of tinkering, suchas duplication of existing nodes plus rewiring, are able to generate such heterogeneousgraphs. These rules are known to be operating in biological evolution, and the emergenceof the protein interaction network seems to provide a clever example (Sole et al. [2002b];Pastor-Satorras et al. [2003]).

It is less clear that such type of evolutionary rules apply in software development.However, a careful analysis of motifs in real software maps (Valverde and Sole [2005])has shown that this is actually the case. Although engineers are building a system with apredefined purpose, they make extensive use of copy-and-paste. The modular architectureof many parts of a large program, and in particular the organization into classes makeseasy to reuse pre-existing pieces that already have a desired set of properties, followed bya convenient set of modifications. It has been shown that, looking at both large-scale and

Page 24: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

24 CHAPTER 2. COMPLEX NETWORKS: LINKED

100

101

10210

0

101

102

103

Cum

ulat

ive

freq

uenc

y

100

101

10210

0

101

102

103

100

101

102

k

100

101

102

103

Cum

ulat

ive

freq

uenc

y

100

101

10210

0

101

102

103

-1.5

A

B

-1.44

Figure 2.3: Structural Properties of Software Networks. (Top) Cumulative degree distri-butions for several class graphs: eMule (N=129, triangles), Blender (N=495, squares) andCrystalSpace (N=1488, circles). Notice how different software systems have a power-law degree distribution P(k) ∝ k−γ following the same exponent γ = 2.5. (Bottom)The software network of videogame ProRally 2002 displays a characteristic asymmetrybetween in-degree (open circles) distribution and out-degree distribution (filled circles).This is a typical pattern displayed by many software systems. In the right, we can appre-ciate heterogeneous features in a class graph for a medium-sized software system (Aztec).

small-scale features (such as network motifs) it is not difficult to explain the most relevantfeatures of software graphs by means of an extremely simple model of duplication andrewiring process. Actually, available data from software development seems to be consis-tent with a simple, multiplicative process of network evolution. Such class of growth hasbeen dubbed “growing network with copying” (GNC). In this framework (Krapivsky andRedner [2005]) the network grows through a blind process of duplication and rewiring.The network grows by introducing a single node at a time. This new node links to mrandomly selected target node(s) with probability p as well to all ancestor nodes of eachtarget, with probability q. The discrete dynamics follows a rate equation:

L(N + 1) = L(N) +mN

⟨∑µ

(p + q jµ)

⟩where L and N are the number of links and nodes, respectively. The second term in theright-hand side describes the copying process, where the average number of links added isgiven byp + q jµ . The µ index refers to the node µ, to be selected uniformly from amongthe N elements. Assuming a continuum approximation, the number of links is driven bythe following differential equation:

dLdN

= mp + mqLN

Page 25: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

2.4. EVOLUTION OF SOFTWARE NETWORKS 25

The asymptotic growth of the average total number of links depends on the extent of copy-ing defined by the product mq. In particular, logarithmic growth is recovered when mq=1and L(N) = mpN logN. This corresponds to a marginal situation separating a domain oflinear growth (mq<1) to a domain of exponential growth (2>mq>1). Interestingly, formq=1 the GNC model predicts a power-law in-degree distribution Pi(k) ≈ k−γi with ex-ponent γi= 2 and an exponential out-degree distribution Po(k), independently of copyingparameters.When looking at the time dynamics of software development, we found thatthey exhibit the patterns predicted by GNC scenarios, close to the mq=1 regime. Theyinclude the sparseness, the asymmetries found in the in- and out-degree distributions (seefigure 2.3), the small worldness and the time dependent logarithmic growth. Indeed, thesparseness seen in software maps is likely to result from a compromise between havingenough dependencies to provide diversity and complexity (which require more links) andevolvability and flexibility (requiring less connections). Here we have uneven, but de-tailed information of the process of software building. In this context, different softwareprojects developments display specific patterns of growth. Specifically, the number ofnodes N grows with time following a case-dependent functional form N = Φ(t). UsingdL/dt = (dL/dN)(dΦ/dt) we have:

dLdN

=[

mp + mqL

Φ(t)

].Φ−1

with a general solution

L(t) = emq∫(Φ

.Φ)−1dt

[mp

∫e−mq

∫(Φ

.Φ)−1dt .

Φ−1dt + Γ

]where Γ is a constant. Using a linear law growth (which is not uncommon in softwaredevelopment), i.e. N(t) = N0 + at, and assuming mq=1, we have:

L(t) = (N0 + at)[

mp log(

N0 + atN0

)+

L0

N0

]Our analysis of available data sets of software development support this scenario (Valverdeand Sole). This agreement suggests that, beyond the specific details of the developmentprocess and (what is more unexpected) the specific performed function, topological pat-terns emerge from the constraints imposed by the rules of software growth. In this context,the studies presented here indicate that function is adapted to the emerging architecturethrough the process, instead of being responsible for the final pattern.

In order to fully characterize some of the graph properties relevant to our study withinan evolutionary context, we have also derived some general information theoretic charac-terizations of complex, static graphs. Our results within this analysis also indicate that thepossible universe of complex networks is actually rather constrained (Sole and Valverde[2004]). Such a constrained set of possibilities fits very well the view of evolution asstrongly dominated by intrinsic constraints (Jacob [1976]; Alberch [1989]; Kauffman[1993]; Goodwin [1994]; Gould [2003]). The outcome of evolutionary searches would be

Page 26: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

26 CHAPTER 2. COMPLEX NETWORKS: LINKED

not any possible architecture but a choice from a narrow subset of attainable structures.This thesis thus provides a remarkable example of constrained structures embedded inthe logical organization of object-oriented software (i.e., class graphs). Eventually, mod-els of software structure should provide insights into how internal and external forcesconstraints processes of artificial design.

Page 27: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Chapter 3

Information dynamics

3.1 Introduction

Understanding information networks also requires going beyond topological organization.We have used the Internet as our basic framework in order to explore the importanceof self-organization and topology in dynamical systems exchanging information amongmultiple agents. To this goal we considered a progressive approach to the dynamics ofdistributed arrays of hosts and routers with different topologies.

Our starting point involves the realization that Internet dynamics is complex, exhibitsa wide array of (multi-fractal) fluctuations and its time-dependent behaviour is the resultof a self-organization process. Additionally, recent studies have revealed that phase transi-tion phenomena arises in Internet traffic and are allowed to quantitative analysis by meansof appropriate tools from statistical physics. An interesting ingredient in this context isthe presence of sudden changes in traffic flows in arrays of computers with grid-like ar-chitecture. It can be shown that in a network of routers and hosts generating packets andcirculating them towards randomly chosen targets, a phase transition between fluid trafficand a congested phase with accumulating packets takes place as the density of hosts orthe rate of packets sent increase beyond a certain threshold. Once the congestion phaseis reached, jamming takes place. Is it possible to explain such a transition in terms of asimple, statistical physics approach not tacking into account the details of the network ele-ments or the routing algorithms used in driving the packets to their targets? Since Internetis itself a complex network, we might start asking ourselves how topology can influence(and eventually explain) the observed patterns of fluctuations, which also follow scalinglaws.

Our analysis using lattice models of traffic dynamics in parallel arrays revealed thatcomplex fluctuations can be easily obtained in regular systems provided that we are closeto the phase transition point (Sole and Valverde [2001]). Some of the key scaling lawsobserved in real situations, both in terms of queue lengths or time delays were shown tobe reproducible with our simple model. An interesting aspect of the dynamical patternsexhibited was the presence of a maximum in entropy (uncertainty) and traffic flows close

Page 28: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

28 CHAPTER 3. INFORMATION DYNAMICS

Figure 3.1: Examples of time series displayed by the local creation rate at three given(host) lattice points, indicated by arrows on the network, where the periodic boundaryconditions are explicitly displayed. Lighter nodes indicate larger congestion. The averagecreation rate is close to criticality and thus, an efficient global flow regime is achieved.However, we can appreciate a wide variety of different source behaviors. Some nodes dis-play self-similar behavior and wild fluctuations (bottom), others display periodic patternsof activity (top) and others are silenced during relative long periods of time (dark nodes).

to the phase transition. In other words, the most efficient state with the maximum numberof delivered packets was also associated to the most unpredictable regime, plagued withfluctuations of all sizes. Such a result was found to be consistent with previous workon phase transitions in models of road traffic, and indicates that self-organizing systemsexhibiting complex, power law fluctuations are likely to take advantage of a the transitionphase. The next step in our analysis was to consider the possibility that agents couldregulate their activity rates (Valverde and Sole [2002]), thus introducing varying levels oftraffic. Since congestion directly influence the behaviour of agents, there was a feedbackhere between agents and the system strongly suggestive of potentially leading to self-organized critical behaviour.

3.2 Self-organized Internet trafficInternet users have different private goals and they are not forced to cooperate. However,a social goal may be to minimize the current Internet load or the total amount of travelingpackets. For instance, some user activities, like web surfing, depend on perceived levelof system responsiveness. This suggests that a feedback between congestion level anduser activity is at work. We have explicitly explored the consequences of this feedbackin an over-simplified lattice model of network traffic (Valverde and Sole [2002]). In this

Page 29: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

3.2. SELF-ORGANIZED INTERNET TRAFFIC 29

10-2

10-1

100

ρ

10-1

100

λ(ρ)

-1.03

Figure 3.2: Scaling dependence between the average critical λ and the host density ρ.Simulations have been performed on a L=64 lattice. The dashed line displays the pre-dicted functional relation as derived from the mean field theory (see text).

model, traffic sources deliver packets to random destinations until a signal of congestion isreceived from the network. Congestion indicates the source to stop sending more packetsuntil enough communication resources are released. We have shown that simple deliveryrules like this, inspired in real congestion avoidance mechanisms in Internet like AIMD(Additive Increase and Multiplicative Decrease) can self-organize the system. An efficientflow regime is achieved at criticality, where unpredictability is maximized and trafficsources display a wide spectrum of different activity patterns (see figure 3.1). A mean-field model for the total density of traffic packets shows the critical point is moving withthe density of hosts.

We have explored the relationship between congestion propagation and the globaljamming transition in the lattice model of network traffic. It seems that microscopic de-tails have little influence on global traffic dynamics at the jamming transition. On the otherhand, queues enable the interaction of travelling packets coming from different flows.A jammed router can initiate a congestion wave that propagates to empty neighbouringrouters (Fukuda et al. [1999]). Despite evident differences, congestion propagation fromone router to another resembles the phenomenon of congestion waves in traffic road.Congestion propagation in the Internet might be related to empirical studies demonstrat-ing long-range correlations. Moreover, nonlinearities of queuing systems cannot explainthe origin of the previous long-range correlations. Instead, statistical features of traffic atthe jamming transition depend on some average quantities, like the amount of free spaceat routers or the inverse (1-Γ) of local packet density (Γ). This suggests the followingmean-field equation of packet density evolution:

dt= ρλ− < k >

DΓ (1− Γ ) (3.1)

where hosts deliver an average number of ρλ new packets to the network, and the neg-ative term indicates the rate of packet removal, which is proportional to the average node

Page 30: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

30 CHAPTER 3. INFORMATION DYNAMICS

degree ¡k¿ and inversely proportional to the average latency time (estimated by averagepath length D or the average number of intermediate routers connecting any pair of sourceand destination hosts). Notice how this equation brings together traffic density, externaltraffic sources, router connectivity, and waiting time. Global jamming transition takesplace when N = L2, where the network attains the efficient flow regime. Assuming thatsome feedback exists between rate of packet creation and packet density (see fig.8 bot-tom/right), the previous mean-field model yields that critical packet creation rate scaleswith the density of hosts:

λ ∼ ρ−1 (3.2)

As shown in (Valverde and Sole [2002]), the above scaling relation holds in simulatedmodels (see figure 3.2), together with the presence of several scaling properties consistentwith a critical state. For instance, distribution of congestion duration lengths reflects awide variation in queue lengths at this jamming transition (ρ < 1); in any moment, somequeues are under loaded, others are jammed or saturated and others are at the queuingtransition.

Although the previous results are obtained by using regular lattice topologies or mean-field approximations, we have extended some of these ideas to a realistic model of Internettopology since regular lattices fail to describe Internet topology. Instead, Internet is a re-markable example of a scale-free architecture. But not every model of scale-free networkfits the real Internet structure because spatial constraints. It has been shown that the net-work model described in (Yook et al. [2002]) falls in the same region of the phase spacewhere the real Internet is located. Many studies of transport on scale-free networks dealalmost exclusively with the Barabasi-Albert model of scale-free network, which does nottake into account the spatial restrictions. In this context, the work presented here is thefirst simulation study of network traffic that deals with such a realistic model of Internettopology.

3.3 Path horizon and network topologyWe have analysed the behaviour of Internet traffic with a topologically realistic spatialstructure as described in a previous study (Yook et al. [2002]). Our model involves self-regulation of packet generation, finite queue sizes and different levels of routing depth (seefig. 3.3). A mean field model of the density of packets related to the rate of packet releasewas described, shown to consistently reproduce the observed patterns. An especiallyimportant finding was to see that the fluctuations generated by the system scaled in a wayconsistent with a self-organized, internally-generated process. The exponents derivedfrom the model are in full agreement with those seen in the Internet, thus supporting ourview that complex dynamics in such systems are largely due to the feedback between theoverall system state and the responses provided by the users.

In order to explicitly consider the feedback between order and control parameters, wecan consider a new mean field approximation based on the previous local rules. It can be

Page 31: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

3.3. PATH HORIZON AND NETWORK TOPOLOGY 31

si

sj

i(s )δm

m

Figure 3.3: Network dynamics and packet routing: it involves a given depth of routing m(a path horizon). Nodes cannot store detailed information about the entire network. Thislimitation necessarily introduces an amount of disorder in routing decisions. A packettraveling from s j to i within the δm(si) domain (of depth m) is deterministically routedalong the shortest path (i.e, for d( j, i) ≤ m. The packet traverses hosts (squares) androuters (circles) indistinctly. Packets traveling outside the m-domain (i.e, for d( j, i) > m)have more than one path choice and perform random walks. As soon as the packet entersinto the m-domain, the packet is deterministically routed along the shortest path. Here wewould have m = 2.

shown, assuming finite H that the new set of equations is now:

dt= ρλ

(1− γ

H

)− 〈k〉

dt= µ (1− λ)− Γ

〈k〉For low density levels (i. e. , consistently with a fluid traffic) the single fixed point(obtained from dΓ/dt = dλ/dt = 0) is

(Γ ∗, λ∗) =

1〈k〉ρD + 1

µ〈k〉

, 1− Γ ∗

〈k〉 M

The Jacobi matrix L for the previous set of equations is given by

L=

(−〈k〉

D ρ−1〈k〉 −µ

)The associated eigenvalues are

Λ± =12

−( 〈k〉D

+ µ

√(〈k〉D

+ µ

)2

− 4ρ

〈k〉

Page 32: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

32 CHAPTER 3. INFORMATION DYNAMICS

0 2 4 6 8 10Path horizon m

0.5

0.6

0.7

0.8

0.9

1.0

Fra

ctio

n o

f vi

site

d n

od

es

Figure 3.4: Fraction of visited nodes F(m) depends on the amount of routing informationavailable in any node, which is defined by depth of routing parameter m. This parameterdefines a sharp transition separating two clearly defined routing schemes. For depth ofrouting values below the critical point (also known as path horizon), packets move likerandom walkers and load is effectively distributed among all available nodes (i.e., F(m) ∼=1). On the other hand, determinism sharply constraints routing paths taken by packetswhen m is above the path horizon (i.e., F(m) < 1). In this regime, packets move alongthe shortest paths minimizing the number of hops from source to destination. However,congestion can not be alleviated by increasing the number of physical links. At the pathhorizon, and optimal balance between efficiency and load balancing is achieved and theflow of packets is maximized.

both of them real and negative: the point attractor is globally stable. These theoreticalpredictions agree with numerical simulations. Also notice how topological features areincluded only as mean quantities, here average degree ¡k¿ and network diameter D.

Moreover, we also reported the existence of a critical path horizon defining a transitionfrom low-efficient traffic to highly efficient flow (see fig. 3.4). This transition is actuallya direct consequence of the Internet’s small world architecture exploited by the routing al-gorithm. Once routing tables reach the network diameter, the traffic experiences a suddentransition from a low-efficient to a highly-efficient behaviour. We conjectured that routingpolicies might have spontaneously reached such a compromise in a distributed manner.Internet could thus be operating close to such critical path horizon.

Finally, the model is consistent with the observation of Internet fluctuations. Manystudies on Internet traffic describe the local fluctuations and behavior of a single link, arouter or a small network comprising few nodes. Another approach quantifies the dif-ferences in traffic and performance over a large number of routers and physical links.When understanding the competition between a network’s internal collective dynamics(i.e., Internet traffic) and external environmental pressures (i.e. behavior of traffic sources

Page 33: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

3.3. PATH HORIZON AND NETWORK TOPOLOGY 33

or hosts, distribution of file sizes) it is useful to study the relationship between the averageflow and the size of fluctuations around the average (Argollo de Menezes and Barabasi[2004]). Dispersion depends on the average flux following a scaling law, where the expo-nent σ can take the values 1/2 or 1.

Argollo de Menezes and Barabasi found that many different systems can be classifiedin two distinct dynamical classes depending on the value of this exponent. On the onehand, the σ=1/2 exponent captures an endogenous behavior, determined by the system’sinternal collective fluctuations. On the other hand, the σ=1 exponent indicates an exoge-nous dynamics driven by the environment. Interestingly, the σ=1 exponent is universal,that is, independent of the nature of the internal dynamics or the network topology. Theanalysis of flow from real Internet routers has revealed a σ=1/2 exponent, thus indicatingthat Internet dynamics has an endogenous origin. This is consistent with measurementstaken from our simulations that predicts the same σ=1/2 exponent (Valverde and Sole[2004]). Real data shows that global traffic dynamics can not be simply reduced as a su-perposition of many and high-variable (infinite variance) traffic sources or other externalfeatures like distribution of file sizes (Willinger et al. [1995]).

Page 34: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

34 CHAPTER 3. INFORMATION DYNAMICS

Page 35: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Chapter 4

Articles

4.1 Selection, Tinkering and Emergence in Complex Net-works

Page 36: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Selection, Tinkering, and Emergence inComplex NetworksCrossing the Land of Tinkering

RICARD V. SOLE,1,2 RAMON FERRER-CANCHO,1 JOSE M. MONTOYA,1,3 AND SERGI VALVERDE1

1ICREA-Complex Systems Lab, GRIB-UPF, Barcelona, Spain2Sante Fe Institute, Santa Fe, NM 87501

3Department of Ecology, University of Alcala, Madrid, Spain

1. INTRODUCTIONThe emergence of the telegraph markedthe appearance of totally new socialand economic exchanges. As a techno-logical innovation, it defined a new sce-nario of communication and informa-tion processing within human societiesthat lead to the creation of new, previ-ously inexistent structures [1]. The re-sulting telegraph network rapidly increased in size (after adelay in its acceptation as a real, useful innovation) and atits climax involved a whole network with millions of users,fully developed codes, encrypted messages and code crack-ers, chats, and congestion problems.

In many ways, the telegraph network was very similar toInternet. At some point, the emergence of a new innovation(such as the telephone, Figure 1; based on [2]) triggered thefall of the rich telegraph network. A whole culture of humancommunication and the networks that covered most urbancenters around the globe vanished in a few decades. Thetelegraph went extinct, as many species and innovationsthrough biological evolution, once a new, highly competi-tive novelty emerges. However, as it happens with mostextinct life forms, the underlying innovations introduced bythe telegraph still persist in modern communication net-works.

The previous example is interesting for two reasons.First, because there is a common pattern between different

types of communication networks thatsuggest common principles of organi-zation. Second, because it also illus-trates the presence of underlying, sub-tile connections between technologicand biologic evolution. This observa-tion is not new: different sources of ev-idence and theoretical arguments indi-cate that technologic innovation shares

some basic traits with the patterns displayed by biological

novelty [3]. The rise and fall of technological creations also

resembles the origination and extinction patterns observ-

able in some groups of organisms and Jacques Monod ac-

tually suggested that the evolution of technology is some-

times closer to Darwinian selection than biology itself [4].

Ricard V. Sole is a physicist and biologist working on complexsystems theory, with particular interest in self-organized criti-cality, theoretical ecology, evolution of RNA viruses, macroevo-lution and extinction, collective intelligence, theoretical aspectsof graph dynamics, and developmental biology. He is withICREA-Complex Systems Lab, GRIB-UPF, Barcelona, Spainand Sante Fe Institute, Santa Fe, NM 87501.Ramon Ferrer i Cancho is a computer scientist working oncomplex networks and the evolution and structure of lan-guage. He is with ICREA-Complex Systems Lab, GRIB-UPF,Barcelona, Spain.

. . . different sources of evidenceand theoretical argumentsindicate that technologic

innovation shares some basictraits with the patterns displayed

by biological novelty.

20 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc., Vol. 8, No. 1

Page 37: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

But are the patterns of organization displayed by complexbiosystems deeply related to those displayed by technolog-ical structures?

The previous question was raised by a number of authorswithin the context of evolution. As discussed by FrancoisJacob in his influential 1977 paper “Evolution as Tinkering,”one important source of divergence between engineering(technology) and evolution is that the engineer works ac-cording to a preconceived plan (in that he foresees theproduct of his efforts) and second that in order to built anew system a completely new design and units can be usedwithout to resort to previous designs [5].

Jacob also mentions the point that the engineer will tendto approach the highest level of perfection (measured insome way) compatible with the technology available. Evo-lution, is argued, is far from perfection, a point alreadymade by Darwin in the Origin of Species. Jacob’s conclusion

is that natural selection does not work as an engineer, but asa tinkerer, who knows what is going to produce but islimited by the constraints present at all levels of biologicalorganization as well as by historical circumstances [5].

Although the presence of historical contingencies playscertainly a role in evolution [6 – 8] recent studies on fractaltransport networks in biology seem to provide strong sup-port for the presence of effective optimization processes [9].Specifically, when looking at the general principles of bio-logical scaling, the assumption of a minimization of theenergy required to transport materials through the network(assuming that it has hierarchical, space-filling structure)leads to a remarkable agreement with the diversity of bio-logical structures and functions observed in nature throughmany orders of magnitude in size. Optimization would thenbe able to operate in a successful manner at least when theconstraints are easily avoided due to the flexibility allowedby the underlying rules of network construction. Whenlooking at some artificial networks, such as the local tele-phone network displayed in Figure 1, we can often appre-ciate a hierarchical organization in the tree-shaped struc-ture. However, once some complexity threshold is reached,the final structure turns to strongly deviate from a treestructure.

A different view of evolution implies the existence ofconstraints derived from the fundamental limitations exhib-ited by dynamical systems [3,10,11]. When looking at themacroscopic level (such as the organism level) strong reg-ularities are perceived that indicate the presence of a lim-ited (though diverse and tunable) range of basic structuralplans of organization. Under this view, in spite of the his-torical contingency intrinsic to the evolutionary process, lifeforms would be nevertheless predictable, at least to someextent. In this context, it has been suggested that emergentphenomena might play a leading role in shaping biologicalevolution. An example that might illustrate this idea is pro-vided by the presence of phase transitions in randomgraphs [3,12,13].

Let us consider a graph n,p that consists of n nodes (orvertices) joined by links (or edges) with some probability p.Specifically, each possible edge between two given nodesoccurs with a probability p. The average number of links(also called the average degree) of a given node will be z np, and it can be easily shown that the probability p(k) thata vertex has a degree k follows a Poisson distribution. Thisso called Erdos-Renyi random graph will be fairly well char-acterized by an average degree z (where the distributionp(k) shows a peak).

This model displays a phase transition at a given criticalaverage degree zc 1 (Figure 2). At this critical point, agiant component forms [12,13]: for z zc a large fraction ofnodes are connected in a web, whereas for z zc thesystem is fragmented into small subwebs. This type of ran-dom model has been used in different contexts, including

Jose Maria Montoya is a theoretical ecologist concerned withecological conservation. He is working on ecological net-works and community assembly. He is with ICREA-ComplexSystems Lab, GRIB-UPF, Barcelona, Spain and Departmentof Ecology, University of Alcala, Madrid, Spain.Sergi Valverde is a software engineer at Ubisoft working onartificial networks and the dynamics of computer networks(Internet). He is also at the Complex Systems Lab.

FIGURE 1

Standard communication networks involve a large amount of hierar-chy. In the case of the telephone network, the terminals are telephonesets and a node is a switching center for routing telephone calls. Herethe graph of a local telephone net is shown. The tips of the treerepresent telephones (from Inose [2]).

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 21

Page 38: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

ecological [14,15], genetic and metabolic [3,16] and neural[17,18] networks. The importance of this phenomenon isobvious in terms of the collective properties that arise at thecritical point: communication among the whole system be-comes available (thus information can flow from the unitsto the whole system and back). Besides, the transition oc-curs suddenly and implies an innovation. No less impor-tant, it takes place at a low cost in terms of the number ofrequired links. Since the only requirement in order to reachwhole communication is to have a given (small) number oflinks per node, once the threshold is reached, order canemerge for free [3].

A new theoretical framework, provided by the study ofcomplex networks, might help to answer some key ques-tions raised by the previous views. In particular, the discov-ery that both natural and artificial systems display a highheterogeneity in their wiring maps challenges the early ap-proaches based on purely random graphs [14,16] and pro-vide a new picture of how complexity (defined in terms ofthe interactions among system’s parts) might emerge. Inthis article the different features exhibited by four types ofnatural and artificial networks are reviewed, after a briefaccount of the basic quantitative characterizations that al-low to measure network complexity. Some key questionsthat will be explored are:

1. What mechanisms have originated observed topologicalregularities in complex networks?

2. To what extent does optimization shape network topol-ogy?

3. What is the origin of homeostasis in complex networks?4. Is homeostasis a driving force or a side effect in network

topology?5. Is tinkering an inevitable component of network evolu-

tion?6. Are engineered systems free of tinkering?

Comparison between the mechanisms that drive thebuilding process of different graphs reveals that optimiza-tion might be a driving force, canalized in biological systemsby both tinkering and the presence of conflicting con-straints common to any hard multidimensional optimiza-tion process. Conversely, the presence of global features intechnology graphs that closely resemble those observed inbiological webs indicates that, in spite of the engineereddesign that should lead to hierarchical structures (such asthe one shown in Figure 1) the tinkerer seems to be at work.

2. MEASURING NETWORK COMPLEXITYSince we are interested in comparing the global features ofboth biological and artificial (engineered) networks, weneed to consider a number of quantitative measures inorder to characterize them properly. In order to do so, thenetwork structure is represented by a graph , as before.

Some of these measures (minimal distance, clustering co-efficient) are usually applied to topological (i.e., static) de-scriptors of the graph structure, but others (entropy, redun-dancy, degeneracy) also apply to states that averagedynamic variables. Most of these measures are unable toexplicitly capture a functional organization and thus need acomplementary knowledge of the underlying system. Al-though an engineered system might look similar to a givenbiological network, the second usually exhibits a high tol-erance to the failure of single units through different ho-meostatic mechanisms that is seldom displayed by theformer.

Small World PatternsRecent research on a number biological, social and techno-logical graphs revealed that they share a common feature:the so called small world (SW) property [19,20]. Small worldgraphs have a number of surprising features that makethem specially relevant to understand how interactionsamong individuals, metabolites, or species lead to the ro-bustness and homeostasis observed in nature. The SW pat-tern can be detected from the analysis of the path length d,defined as the average minimum distance between any pairof nodes. For ER graphs, we have very short distances.

FIGURE 2

Phase transition in random graphs: In (A–D) four different states of thewiring process are shown for a small graph with n 12 nodes. Here(A) z 0.22, (B) z 0.6, (C) z zc 1, and (D) z 2. Aphase transition occurs at zc 1, where the fraction of nodesincluded in the giant component rapidly increases. This picture cor-responds to a very small system. Appropriate characterizations of thephase transition require large network sizes.

22 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc.

Page 39: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Specifically, it can be shown that dER log(n)/log(z).Graphs where d dER are said to be “small-world.” A SWcan be obtained from a regular lattice (where nodes arelinked to z nearest neighbors) if a small fraction of nodes arerewired to randomly chosen nodes. Thus a small degree ofdisorder generates short paths (as in the random case) butretaining the regular pattern [19]. Small world networkspropagate information very efficiently.

The so called clustering coefficient C measures the prob-ability that two neighbors of a given node are also neighborsof one another. For an ER graph (ER), CER z/n and is thusa very small quantity. Watts and Strogatz [19] noticed that C Crandom when looking at real networks. High clusteringfavours small-worldness but it is not the only mechanism[21].

Degree DistributionsA different type of characterization of the statistical prop-erties of a graph is given by the degree distribution P(k).Although the ER graph displays a Poisson distribution, mostcomplex networks are actually characterized by highly het-erogeneous distributions: they can be described by a degreedistribution P(k) k (k/), where (k/) introduces acut-off at some characteristicscale . Three main classescan be defined [22]. (a) When is very small, P(k) (k/)and thus the link distributionis single-scaled. Typically,this would correspond to ex-ponential or Gaussian distri-butions; (b) as grows, a power law with a sharp cut-off isobtained; (c) for large , scale-free nets are observed. Thelast two cases have been shown to be widespread and theirtopological properties have immediate consequences fornetwork robustness and fragility [23,24]. The three previousscenarios are observed in: (a) power grid systems and neuralnetworks [25], (b) protein interaction maps [26], metabolicpathways [27], ecological networks [28 –30] and electroniccircuits [31] and (c) Internet topology [23,32], scientific col-laborations [33] and lexical networks [34].

Redundancy and DegeneracyComplex biological networks (see below) exhibit an extraor-dinary homeostasis against random failure of a given unit.Partially based on the experience from technological sys-tems, it was often assumed that such robustness was essen-tially due to redundancy: the failure of a given unit (such asa gene in a gene network) would be compensated by a copyof it. Nonetheless, the analysis of biological networks revealsthat robustness is mainly associated with features that arefairly different from redundancy [35,36]. Instead, it has beenshown that, in many cases, totally different components canperform similar functions. This feature is known as degen-

eracy [36,37]: unlike with redundancy, which involves struc-turally identical elements, degeneracy involves structurallydifferent elements that yield the same or different functionsdepending on the context. Mounting evidence suggests thatit is actually a ubiquitous property of biological nets.

Degeneracy is very common in natural systems [36] buttotally unknown within the context of technological evolu-tion. In man-made systems, redundancy is the standardsolution to the problem of random failure of single compo-nents: by introducing copies of sub-parts of the system,failure of one of them can be compensated by its copy. Thiswas assumed to be the origin of resilience against mutationin genetic networks, although later evidence indicates thatthe source of homeostasis in cellular nets is a very differentone (see below). To a large extent, degeneracy is intimatelyassociated with tinkering in evolution: it reflects the re-usethat is made of different parts of a system in order to achievesimilar functions.

ModularityModularity pervades biological complexity [38]. Many cellfunctions are carried out by subsets of units that definefunctionally meaningful entities. An example are modules of

genes involved in develop-ment [39 – 41].

Modularity allows the ad-aptation of different func-tions with a small amount ofinterference with other func-tions and is likely to be a pre-requisite for the adaptation of

complex organisms, although it arises most likely as a by-product of adaptability rather than being an adaptationitself [42]. Modularity can arise in two ways: by parcellationor by integration. Parcellation consists of the differentialelimination of cross-interactions involving different parts ofthe system. Instead, if the network is originally formed bymany independent, disconnected parts, it is conceivablethat modularity arises by differential integration of thoseindependent characters serving a common functional role.

3. PROTEOME MAPS AND GENE NETWORKSLet us start our exploration from molecular cell biology.Complex genomes involve many genes that are associatedwith at least one regulatory element and each regulatoryelement integrates the activity of at least two other genes.The nature of such regulation started to be understood fromthe analysis of small prokaryotic regulation subsystems [43]and the current picture indicates that the webs that shapecellular behavior are very complex, sharing some commontraits with neural networks and related computational sys-tems [44].

Gene regulation takes place at different levels and in-volves the participation of proteins. The whole cellular net-

To a large extent, degeneracy is intimatelyassociated with tinkering in evolution: it reflects

the re-use that is made of different parts of asystem in order to achieve similar functions.

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 23

Page 40: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

work includes three levels of integration: (a) the genome(and the regulation pathways defined by interactionsamong genes), (b) the proteome, defined by the set of pro-teins and their interactions, and (c) the metabolic network,also under the control of proteins that operate as enzymes.Unlike the relatively unchanging genome, the dynamic pro-teome changes through time in response to intra- and ex-tracellular environmental signals. The proteome is thus par-ticularly important: proteins unify structural and functionalbiology. They are both the products of gene activity andregulate reactions or pathways.

A key issue concerning the evolution of cellular nets israised by their robustness against single-unit failure. Theanalysis of the effects of mutations in different organismsrevealed an extraordinary level of homeostasis: in manycases the total suppression of a given gene in a given or-ganism leads to a small phenotypic effect or even no effectat all [36]. By following the analogy with engineered sys-tems, it would be suggested that such robustness wouldcome from the presence of a high degree of redundancy.Under mutation, additional copies of a given gene mightcompensate the failure of the other copy. However, theanalysis of redundancy in genome data indicated that re-dundant genes are rapidly lost and thatredundancy is not the leading mecha-nism responsible for mutational ro-bustness [45].

The degree distribution displayed bythe protein interaction map is given bya power law [26,35,46] i.e., P(k) k

with 2.5, with a sharp cutoff forlarge k. The link seems obvious here: the high degree ofhomeostasis against random failure would come from thehighly heterogeneous distribution of interactions. This con-jecture is actually supported by comparing the phenotypiceffects of mutated genes with their degree: there is a clearpositive correlation between degree and phenotypic effectsof mutations [26]. Consistently with the SF scenario, muta-tions involving some key genes can have very importantconsequences. This is the case, in particular, of the p53tumor suppressor gene (Figure 3; redrawn from [47]) whichis known to play a critical role in genome stability andintegrates many different signals related to cell-cycle orapoptosis (programmed cell death) [48]. This and othertumor-suppressor genes prevent cell proliferation (thuskeeping cell numbers under control) but can also promoteapoptosis.

What can be concluded here? On the one hand, history isan obvious ingredient of the genome/proteome evolution:the growth of these nets takes place by gene duplication [49]and simple models based on gene duplication plus re-wir-ing successfully reproduce the observed properties of pro-teome maps [50 –52]. In some sense, tinkering is widely usedby starting from previous genes and interactions. The mu-

tational robustness displayed by cellular nets might actuallyprovide the best example of the success of degeneracy inevolution [36]. It is also clear that many genes are recruitedat different stages through development and thus re-use ofavailable building blocks is widespread.

The analysis of metabolic pathways reveals that somemetabolites that are known to be older are highly connected[53] thus suggesting preferential attachment at least in earlystages of evolution (see also [54]). But models indicate thatSF structure might spontaneously emerge provided that therates of duplication and rewiring are appropriately tuned. Inthis sense, it might well happen to occur that, as a conse-quence of the duplication process, together with a sparsedensity of connections (associated to low cost constraintsbut also to dynamical constraints) the high robustness pro-vided by the scaling structure might actually be a byproductof network growth [50,51] under optimization of communi-cation. In that case, the reach of a sparse, but connectedweb of interactions would automatically provide an emer-gent source of robustness for free.

A final point to be mentioned is the fact that the need ofan integrated, cellular order in the adult organism mustcoexist with some degree of flexibility associated to the

developmental program. The buildingof a multicellular organism requires acontrol program that is exemplified bytumor-suppressor genes (such as p53).But this program has to integrate someweaknesses in order to allow for embry-ogenesis, growth and wound healing

[55]. Perhaps not surprisingly, the ho-

mologue of p53 in invertebrates (such as Drosophila) is

expressed throughout development, particularly in early

stages [56]. It might be the case that one of the side effects

of reaching a complex network with a high degree of ho-

meostasis from a sparse, well-communicated graph of in-

teractions, is that inevitably sooner or later the node that

fails is a key one (and cancer develops). Under these con-

ditions, cellular functions are irreversibly altered and the

cellular context (which imposes some control on cell states)

is no longer a constraint.

4. ECOLOGICAL NETWORKSThe effects of biodiversity loss are highly mediated by the

architecture of ecological interactions. Ecosystems (defined

broadly as the set of interacting species within a delimited

habitat) are likely to be only rivaled in complexity by neural

systems or the global economy [57]. Despite such complex-

ity, early studies have found some interesting topological

patterns that seem to be related with their stability and

functionality (see [58] for a review). In this sense, recent

characterization of complex topologies in several ecosys-

tems may provide an amazing framework for predicting

Gene regulation takes place atdifferent levels and involves the

participation of proteins.

24 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc.

Page 41: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

FIGURE 3

The p53 network (redrawn from Kohn [47]). Here a subgraph embedded in a much larger network (the interaction map of the mammalian cell cycle controland DNA repair systems) is shown. The activation of this network (due to different types of stresses) leads to the stimulation of enzymatic activities thatstimulate p53 and its negative regulator, MDM2.

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 25

Page 42: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

how perturbations might propagatethroughout ecosystems.

It is clear that tinkering play a lead-ing role in conforming local biotas. Lo-cal assemblages comprise only some ofthe variety of species richness presenton the Earth (between 10 and 100 mil-lion species), conforming a myriad of ecosystems wherespecies are connected via different interaction types (e.g.,trophic, competitive, or mutualistic). Thus, the ecosystems’tinkerer works only with available species from the regionalspecies pool, and the sequence of species invasions partiallydetermines the composition of resulting communities. Forinstance, some field experiments have shown that priorityeffects are above competition, that is, if one species arrivebefore another that is a better competitor, the formershould persist in the community, and the latter will not beestablished.

The food web is the basic description of ecosystems,where nodes are species (or sets of species with similardiets) and links are feeding relationships. Trophic structureinfluences the performance of several ecosystem functions(e.g., productivity, nutrient capture, and cycling) and deter-mines the effects of the propagation of disturbancesthrough the entire system. In particular, the effect of onespecies on the density of another tends to diminish withtheir separation in the food web, measured by the shortestpath connecting them. Typically, if species A and species Bare more than 3 links away, a disturbance in the density ofA does not influence B [59]. But most species within com-munities are separated by only 2 or 3 links from each other[60,61], so perturbations might propagate through the entirenetwork. In most of these food webs the clustering is clearlyhigher than the expected one from randomness [61], pro-viding evidence of SW patterns despite their relative smallsize (less than 200 nodes).

Food webs exhibit a hierarchy of discrete trophic levelsin such a way that a species within a trophic level only feedson species belonging to the immediately lower level (e.g.,carnivores feed on herbivores that eat plants). If this hier-archy were strict, no clustering would be observed. Thus,the origins of clustering in this type of systems is the pres-ence of omnivorous species, that is, species that feed onmore than one trophic level, including intra-trophic levelpredation. Omnivory also contributes to shorten paths be-tween species, and it has been shown both empirical andtheoretically that omnivory tends to increase the stability ofecosystems by reducing population fluctuations [62].

Ecological networks also display heterogeneous degreedistributions (Figure 4), fitting in most cases a truncatedpower-law P(k) k (k/), where (k/) introduces asharp cutoff. In particular, species-rich food webs[28,29,61,63,64] and most plant-animal mutualistic net-works [30] display this type of distribution, with 1.0.

This exponent is different from thoseobserved in cellular, social and techno-logical networks ( 2.1–3.0). Both fea-tures (sharp cut-offs and low ) suggestthe existence of constraints in the as-sembly of ecological networks, as thelimits to the addition of new links due

to different phenological attributes of species [30], or due todynamical constraints related with the persistence of preys(a prey with many connections is more affected by changesin the food web) or with the efficiency of predators (themore connected you were, the less efficient you will be [65].These constraints add to cost-related constraints [22] inlimiting the prevalence of preferential attachment as themechanism underlying ecological organization. In contrast,these constraints does not seem to work on small-size foodwebs, where degree distributions are closer to a randomwiring with typical Poisson distributions [29,64].

A deep understanding of the architecture of ecologicalnetworks allow us to identify keystone species within eco-systems, that is, species whose removal trigger many sec-ondary extinctions and the fragmentation of the networkinto disconnected subwebs in which species are more proneto extinction. Keystone species are typically the most-con-

The effects of biodiversity lossare highly mediated by thearchitecture of ecological

interactions.

FIGURE 4

The graph of species interactions of Silwood park web. Here eachnode represents one species. The central node in this representationis the Scotch broom Cytisus scoparius.

26 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc.

Page 43: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

nected ones, and food webs with skewed degree distribu-tions are predicted to exhibit an extreme fragility undertheir successive loss in contrast to the high robustness whenspecies are removed randomly [28,29]. Also, functions per-formed by the ecosystems are more likely to be lost after theremoval of such keystones, although in some cases someless-connected species perform key specific functions, sotheir removal may also have large effects on ecosystemfunctionality (e.g., the loss of plants that fix atmosphericnitrogen). A key question not yet resolved is how this dualityfragility-robustness is affected when dynamics is intro-duced.

Species play specific roles in ecosystems, both affectingpopulation dynamics (e.g., a carnivore controlling the den-sity of several herbivores, thus guaranteeing diverse plantassemblages), and enhancing several ecosystem functions(e.g., biomass, energy use, or nutrient retention). Can theroles of eliminated species be replaced by other species, andif so, by what kind of mechanisms? Removal experimentsand simulations have shown the presence of some degree ofadaptability, depending on the degree of redundancy of themanipulated ecosystem. In particular, more redundancyimplies the maintenance of some ecosystem functions [66 –69] and reduce the risk of cascading extinctions after ran-dom extirpations [70].

What is called redundancy in the ecological literature iswhat we have previously defined as degeneracy, because“replacement” species and removed species always havedifferent traits. In fact, evolution tends to reduce redun-dancy levels (i.e., species with identical traits) while increas-ing degeneracy by promoting species that are complemen-

tary and overlapping in their resource requirements butdifferent in their environmental tolerances [71,72]. But oftenthis mechanism of compensation is observed with a largedelay or simply it does not happen, particularly when key-stone species are lost (e.g., [73]).

Thus, it is clear that a topological (static) approximationis a feasible first step in the understanding of communityhomeostasis. However, is such homeostasis a side effect ofan optimization of any function of the ecosystem? Ecosys-tems perform several functions that evolve along their spa-tio-temporal organization. Nevertheless, what functions, ifany, are optimized, is far from being clear. Some authorshave suggested that ecosystems are likely to maximize theirefficiency in transfering energy and materials. But theseapproximations are highly speculative and have several ca-veats. For instance, some mechanisms promoting commu-nity homeostasis imply a decrease in efficiency, as the pres-ence of omnivorous species, which are typically inefficientin consuming their preys. However, simple optimizationmechanisms observed in other networks are an interestingarea for exploring the relationships between maximizationof some ecosystem functions and homeostasis.

5. LEXICAL NETWORKSThe emergence of human language is one of the majortransitions in evolution [74]. We humans possess a uniquesymbolic mind capable of language which is not shared byany other species [75]. Human language allows the con-struction of a virtually infinite range of combinations (i.e.,sentences) from a limited set of basic units. The process ofsentence generation is astonishingly rapid and robust andindicates that we are able to rapidly gather words to formsentences in a highly reliable fashion.

The study of lexical networks (networks in which nodesare words and links are formed between strongly correlatedwords; see Figure 5) [34] and other linguistic networks[76,77] has shown that scaling is a strong regularity in hu-man language. Links in lexical networks capture syntacticrelationships between pairs of words. The degree distribu-tion follows P(k) k with 3. Hubs in lexicalnetworks are function words (e.g., prepositions, articles, anddeterminers). Hubs and thus function words are crucial forthe lexical network small-worldness. Function words con-stitute the most stable set of words of a language over time.Their high connectivity explains why they are less extinctionprone in the same way omnivorous species are in ecologicalnetworks.

New function words are neither created from scratch norborrowed from other languages. Function words result fromgrammaticalization processes [78] in which nonfunctionwords become function words and function words becomemore grammatical [79]. Languages largely result from tink-ering: Prepositions typically derive from terms of body partor verbs of motion, while modals typically derive from terms

FIGURE 5

The topology of Moby Dick. Two words appearing in Melville’s bookare linked if their mutual information is greater than a given threshold.

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 27

Page 44: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

of possession or desire [78].Thereafter, hub formation inlexical network evolution, op-erates on a restricted set ofcandidate words. Every lan-guage creates new functionsfrom such a restricted set so convergence is likely not onlyregarding the result of the grammaticalization process butalso regarding its starting point.

The fact that the degree (i.e., number of links) of a wordis positively correlated with its frequency [34] and thatamong the most frequent words there are the etymologi-cally oldest ones [80,81], may suggest preferential attach-ment is at play. Nonetheless, there are many reasons forthinking that minimization of word-word distance is in-volved. On the one hand, speaking is a complex task. Thespeaker must combine a large amount of words (on theorder of many thousands [82]) for forming sentences. Fol-lowing links in a lexical network leads to syntactically well-formed sentences. If two words are to be linked duringspeech production, the smaller the distance between them,the smaller the amount of intermediate words required forperforming the linkage. Average word-word distance in lex-ical networks is about d 2.6, indicating that most of thewords are reachable through less than two intermediatewords [34].

On the other hand, it has been proposed that optimizingthe small-worldness of a network under linking cost con-straints may be the origin of the 3 scaling exponent[83]. Small world behavior is a desirable property of a lin-guistic network and linking restrictions are likely to be atplay. If linking cost is not taken into account, the optimalconfiguration would be a clique (every word connected toevery word; Figure 6A) which would imply that all words arefunction words but none of them is a lexical word (whichcan be seen as having linkers but no words to link). Linkingcost may not be the only restriction at play. A star network(every vertex connected to the same vertex; Figure 6B) pro-vides the minimum distance possible at the minimum link-ing expense. In this case, there are words to link, but onlyone linker. Nonetheless, real languages have a rich reper-toire of linkers in order to account for different types ofrelations (e.g., part-hole, action-receiver).

The finding of a 3 exponent helps us to understandwhy syntax is a robust trait of human communication. Thedegree of expression of such a feature is not correlated withintelligence and thereafter not surprisingly present evenwhen intellectual skills are extremely poor, which is the caseof the idiots savants [75]. According to previous work onerror tolerance of scale-free networks [23,24], a lexical net-work will be very robust against random removal of wordsbut fragile against removal of the most connected vertices.Agrammatism, a kind of aphasia in which many of the mostconnected words seem to have disappeared, is character-

ized by a decrease in the abil-ity to build syntactically com-plex sentences [84]. Unlessthe words that glue words forbuilding complex sentencesare removed (the “short-

pathers”), a complex phrase (e.g. a circumlocution) canreplace a missing word and the expressive power will bemaintained.

6. TECHNOLOGY GRAPHSSocial and economic complexity are organized around threemajor networks: the transportation network, the power net-work, and the communication network [2]. These are alsothree components of biological complexity, involving theprocessing of information, energy and matter at very differ-ent scales. Artificial networks offer an invaluable referencewhen dealing with the rules that underlie the building pro-cess in complex systems. The Internet, inparticular, power-fully exemplifies the importance of topology and homeosta-sis in SF nets [23,24] and how it relates with biologicalcomplexity. The spread of computer viruses, for example, isclosely related to the epidemic patterns displayed by naturalviruses [85]. Not surprisingly, man-made artifacts have beenused as a metaphor of biological complexity at differentlevels. The brain, for example, has been compared to me-chanical engines, the telegraph and telephone webs, or todifferent computer architectures, as technological changesfollowed each other through time. These metaphors areoften flawed by the lack of a real mapping between bothsystems (there is little overlap between the telegraph net-work and cortical maps beyond the presence of a wiringdiagram), but sometimes the relationships are surprisinglystrong [86].

Technological graphs result from predefined planning.As noted by Jacques Monod: “the difference between arti-ficial and natural objects appears immediate and unambig-

FIGURE 6

Opposite ways of achieving small worldness. (A) The maximum linkingexpense (a clique). (B) The minimum linking expense (a star graph).

The food web is the basic description ofecosystems, where nodes are species . . . and links

are feeding relationships.

28 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc.

Page 45: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

uous to all of us. A rock, a mountain, a river, or a cloud—these are natural objects; a knife, a handkerchief, a car—areartificial objects, artifacts” [4]. But technological design isnot completely free of the constraints imposed by complex-ity [87]. On the one hand, conflicting constraints effectivelylimit the reaching of optimal solutions: the process of de-sign is itself a search process on a rugged landscape withmany implicit variables.

The study of electronic circuits reveals that small world ispresent in technological artifacts [31]. Modularity in thesesystems is not only a direct consequence of the need ofdifferent sub-parts performing different (but complemen-tary) functions but also the result of needing to reuse exist-ing circuits. The engineer avoids designing a new largecircuit from scratch. The engineer is encouraged to work asa tinkerer when the size of the circuit crosses a certainthreshold. Optimization is present at different levels of thecircuit design [31]. For instance, minimization of both av-erage path length and physical distance are present and thiscan easily lead to SW structure [88]. Interestingly, the largercircuits clearly exhibit power laws in their degree distribu-tions [31].

A very important class of networks derived from softwarearchitecture maps [89], has been recently shown to displayboth SW and SF patterns as the nonplanned result of adesign optimization process [90]. Software architecture isreflected by a diagram of the software components and theirinterrelations, which is produced during the design process[91]. Sometimes the diagram is not explicitly provided, but

it is possible to reconstruct it from source code (reverseengineering). This map can be interpreted as a graph wherenodes are software components and the links are relation-ships between software components.

A large effort has been dedicated to understand the na-ture of software and why build efficiently and maintain largesoftware systems is so difficult (and costly). For years, thesoftware community has been promoting the need of soft-ware measurement tools that help to quantify if a softwareproject is being developed “well” and controlling the devi-ations from the stablished engineering plan [92]. Early mea-sures of software were centered in intra-module aspects likeprogram length or number of lines of code (LOC). Recently,there is a growing interest in analyzing software structure orsoftware architecture measurement (inter-module). Themodern software engineer conceives software systems at ahigher conceptual level and is more concerned about howthe different components are assembled and interrelated toeach other [93].

A closer look at different software architecture mapsindicates that there is a noticeable difference in the archi-tecture of small-scale (i.e., a Gauss-Seidel linear systemsolver, sorting an arbitrary sequence of numbers) and thearchitecture of large-scale software (i.e., an operating sys-tem or a modern videogame). In both cases, different con-flicting constraints (economization of memory storage, ef-ficiency of processes, ease of integration of changes andnew features) must be satisfied simultaneously while devel-oping software. For small-scale software it is possible toproduce very optimized structures because the constraintsare not too hard to satisfy. The architecture of small-scalesoftware abounds in hierarchies and simple connectivitypatterns (Figure 7). The degree distribution is Poissonian.

Large-scale software shows clustering, low mean dis-tance between software components (about six) and scale-free degree distribution, the exponent ranging from 2 to3 [90]. For large-scale software the constraints are muchharder and history plays and important role. In order toreduce the complexity of software development, the systemis partitioned into modules that group similar or relatedsoftware components. Usually, the modules are developedseparately. One of such modules roughly corresponds to asingle connected component in the software architecturemap, but another modules could have more than one com-ponent. Even a single connected component can span sev-eral modules. Briefly, the modules are a logical partition ofthe software architecture that can or cannot coincide withthe set of connected components.

There are few components with a huge number of links(hubs) in most software architectures. Surprisingly, thesehubs are considered a bad practice by software engineeringprinciples [94]. The existence of hubs is an indication thatonly a sub-optimum solution can be reached. Sometimesthe cost of introducing new components is higher than

FIGURE 7

An example of a medium-sized component of a software graph(extracted from the Java Development Kit 1.2 library). Here each nodeis a java class and edges indicate relationships among classes.

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 29

Page 46: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

adapting an existing one to provide the required function-ality. As the number of dependencies of the software com-ponent grows up, it progressively looses its intended origi-nal meaning, assuming more and more responsibilitiesfrom other system components. This class of componentslimits reuse and are expensive in terms of memory storage.Also, they tend to be very affected by changes made to othercomponents of the system.

There is a certain degree of redundancy in large-scale soft-ware. Because different parts of the system are developed inparallel by a team of software engineers, there is a chance thatthe same subproblem will be solved twice. This is also knownas “reinventing the wheel” and it is considered one of the maincauses of productivity loss. The duplication of software func-tions is not recommended by software engineering becausethe increased effort required to fix errors and to extend thesystem functionality. In fact, it is encouraged to seek andlocate duplicated portions of code and substitute them by asingle software component. Moreover, a good practice of soft-ware engineering promotes reusing of large portions of code(or better, entire software components) not only within a soft-ware system but from project to project [91]. In any case, itseems that such traditional claims for reusing are very difficultto catch effectively and the way to achieve it still being pursuedby modern software engineering.

7. DISCUSSIONThe heterogeneous character of most complex biologicalnetworks reveals a surprising example of convergence. Inevolutionary theory, convergence refers (within the ecolog-ical context) to the observation that organisms living insimilar habitats resemble each other in outward appear-ance. These similar looking organisms may, however, havequite different evolutionary origins. Convergent evolution

takes place at very different levels, from organisms to mol-

ecules, and here we propose the idea that a new type of

convergent evolutionary dynamics might be at work under-

lying a very wide class of both natural and artificial systems.

Since very different systems seem to choose the same

basic formula for their interaction maps, we can easily rec-

ognize a general trend that can be identified as a process of

convergent evolution (see Table 1). One particularly impor-

tant point is the fact that similar network topologies (par-

ticularly the scale-free ones) emerge in biological and hu-

man-made systems. Although the first take advantage of the

high homeostasis provided by scaling, the second are com-

pletely dependent on the correct functioning of all units.

Failure of a single diode in a circuit or of a single component

in a software system leads to system collapse. Thus, ho-

meostasis cannot be a general explanation for scaling.

The apparently universal character of these scaling laws

in such disparate complex networks goes beyond ho-

meostasis. We conjecture that the leading force here is an

optimization process where reliable communication at low

cost shapes network architecture in first place. This seems

to be the case in all the previous systems analyzed. Once a

small, critical average connectivity is reached, the graph

experiences at percolation providing a spontaneous order

linked to global communication. This occurs at a low cost,

since the transition is sudden and effectively connects all

parts of the system with a small number of links per unit.

There are two possible strategies for decreasing vertex-

vertex distance at the percolation point: (a) increasing the

average connectivity and (b) hub formation (Figure 6). (a) is

a trivial strategy whose outcome under ideal circumstances

is a clique (Figure 7). (b) has the advantage of not implying

the addition of new connections. Link rearregements suf-

fice. In contrast, (b) is a more complex task than (b). The

TABLE 1

Summary of the Basic Features that Relate and Distinguish Different Types of Complex Networks, Both Natural and Artificial

Property Proteomics Ecology Language Technology

Tinkering Gene duplication and recruitation Local assemblages fromregional species pools andpriority effects

Creation of words fromalready established ones

Reutilization of modules andcomponents

Hubs Cellular signaling genes (e.g.,p53)

Omnivorous and mostabundant species

Function words Most used components

What can be optimized? Communication speed and linkingcost

Unclear Communication speed withrestrictions

Minimize development effortwithin constraints

Failures Small phenotypic effect ofrandom mutations

Loss of only a few species-specific functions

Maintenance of expressionand communication

Loss of functionality

Attacks Large alterations of cell-cycle andapoptosis (e.g., cancer)

Many coextinctions and lossof several ecosystemsfunctions

Agrammatism (i.e., greatdifficulties for buildingcomplex sentences)

Avalanches of changes and largedevelopment costs

Redundancy and degeneracy Redundant genes rapidly lost R minimized and D restrictedto non-keystone species

Great D Certain degree of R but no D

Here different characteristic features of complex nets, as well as their behavior under different sources of perturbation, are considered.

30 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc.

Page 47: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

outcome of (b) is a star network (Figure 6B). Scale-freedistributions suggest the use of strategy (a) in real networks.If reaching a low density of links is more important thanhaving a small average path length, skewed distributions,including scale-free nets, are easily obtained [83] throughrandom search. Nonetheless, other mechanisms, such aspreferential attachement [25,95] have been proposed forscaling. Unfortunately, preferential attachment can notstraightforwardly explain the high clustering coefficient ofreal networks. C is a measure of cliquishness (more pre-cisely, the abundance of cliques of 3 vertices). Thereafter, ahigh clustering coefficient can be seen as one side of theoptimization process. Natural and artificial networks showthat the higher the size of the network, the higher thesignificance of the clustering coefficient and also the powerappearance of the degree distribution [31,64]. The coexist-ence of strategies (a) and (b) can be understood as theexplosion of conflicting constraints once the networks sizeexceeds a certain threshold value. The fact that even engi-neers become tinkerers in large systems illustrates howcomplicated is the achievement of optimal structures oncethey reach some complexity level. Clustering is likely to beunavoidable for small-worldness in networks in which hubformation becomes a dramatically complicated task. In thiscontext modularity is an obvious source of clustering. Wehave suggested that homeostasis is a consequence of a moregeneral principle. Actually, it can be a side effect of optimi-

zation and not a direct consequence of functional parcella-tion in large networks.

The scale-free distributions observed in both natural andartificial graphs suggests that the homeostasis won by thesecond might well be a result of exploiting the SF topologyresulting from optimization [90]. This is something not (yet)exploited in current engineered systems, probably due tolack of degeneracy. Since degeneracy is a common featureof biological nets, it might have been exploited (or co-evolving) within heterogeneous architectures. We conjec-ture that there is a largely universal principle that pervadesthe evolution of scale-free nets (optimal communication)and that the observed topological features of bionets reflectthis feature together with constraints arising from othercauses, such as the need of modular organization. It isinteresting to see that manmade designs also evolve towardwebs that strongly resemble their biological counterparts: asshown by the previous examples, often the paths towardoptimization seem to cross the land of tinkering.

AcknowledgmentsThe authors thank William Parcher and the members of theComplex Systems Lab for useful comments. This work hasbeen supported by a grant PB97-0693, the Santa Fe Institute(R.V.S.) and grants from the Generalitat de Catalunya (FI/2000-0393, R.F.C.), the Comunidad de Madrid (FPI-4506/2000, J.M.M.).

REFERENCES1. Standage, T. The Victorian Internet; Walker and Co.: New York, 1998.2. Inose, H. Communication networks. Sci Am 1972, 3, 117–128.3. Kauffman, S.A. Origins of Order; Oxford: New York, 1993.4. Monod, J. Le hasard et la necessite, Editions du Seuil: Paris, 1970.5. Jacob, F. Evolution as tinkering. Science 1977, 196, 1161–1166.6. Gould, S.J. Wonderful Life; Penguin: London, 1989.7. Conway, M.S. The Crucible of Creation. Oxford University Press: New York, 1998.8. Gould, S.J. The Structure of Evolutionary Theory; Belknap Press: Cambridge, MA, 2002.9. Brown, J.H.; West, G.B. Scaling in Biology. Oxford University Press: New York, 2000.

10. Alberch, P. Developmental constraints in evolutionary processes. In: Evolution and Development, J.T. Bonner, ed. (Berlin, Springer-Verlag) 1982, p 313–332.11. Goodwin, B.C. How the Leopard Changed Its Spots: the Evolution of Complexity. Charles Scribner’s Sons: New York, 1994.12. Bollobas, B. Random Graphs. Academic Press: London, 1985.13. Newman, M.E.J. Random Graphs as Models of Networks. Santa Fe Institute Working Paper 02-02-005, 2002.14. May, R.M. Stability and Complexity in Model Ecosystems; Princeton University Press: Princeton, 1976.15. Cohen, J.E. Food webs and niche space. Monographs in Population Biology 11, Princeton University Press, 1978.16. Kauffman, S.A. Metabolic stability and epigenesis in randomly connected nets. J Theor Biol 1962, 22, 437–467.17. Hertz, J.; Krogh, A.; Palmer, R.G. Introduction to the Theory of Neural Computation. Addison-Wesley: Reading, 1991.18. Sompolinsky, H.; Crisanti, A.; Sommers, H.J. Chaos in random neural networks. Phys Rev Lett 1988, 61, 259–262.19. Watts, D.J.; Strogatz, S.H. Collective Dynamics in ‘small-world’ networks. Nature (Lond.) 1998, 393, 440–442.20. Newman, M.E.J. Models of the small world. J Stat Phys 2000, 101, 819–841.21. Dorogovtsev, S.N.; Mendes, J.F.F. Evolution of Random Networks. Adv Phys 2002, 51, 1079–1187.22. Amaral, L.A.N.; Scala, A.; Barthelemy, M.; Stanley, H.E. Classes of behavior of small-world networks. Proc Nat Acad Sc USA 2000, 97, 11149–11152.23. Albert, R.; Jeong, H.; Barabasi, A.-L. Error and attack tolerance of complex networks. Nature 2000, 406, 542.24. Albert, R.; Jeong, H.; Barabasi, A.-L. Correction: Error and attack tolerance of complex networks. Nature 2000, 409, 378–382.25. Albert, R.; Barabasi, A.-L. Statistical mechanics of complex networks. Rev Mod Phys 2001, 74, 47–97.26. Jeong, H.; Mason, S.; Barabasi, A.L.; Oltvai, Z.N. Lethality and centrality in protein networks. Nature 2001, 411, 41.27. Jeong, H.; Tombor, B.; Albert, R.; Oltvai, Z.N.; Barabasi, A.-L. The large-scale organization of metabolic networks. Nature 2000, 407, 651–654.

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 31

Page 48: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

28. Sole, R.V.; Montoya, J.M. Complexity and fragility in ecological networks. Proc Roy Soc Lond Ser B 2001, 268, 2039–2045.29. Dunne, J.A.; Williams, R.J.; Martinez, N.D. Food-web structure and network theory: The role of connectance and size. Proc Natl Acad Sci USA 2002, 99,

12917–12922.30. Jordano, P.; Bascompte, J.; Olesen, J.N. Invariant properties in coevolutionary networks of plant-animal interactions. Ecology Lett 2002 (in press).31. Ferrer-Cancho, R.; Janssen, C.; Sole, R.V. The topology of technology graphs: small world pattern in electronic circuits. Phys Rev E 2001, 64, 32767.32. Caldarelli, G.; Marchetti, R.; Pietronero, L. The Fractal Properties of Internet. Europhys Lett 2000, 52, 386–390.33. Newman, M.E.J. The structure of scientific collaboration networks. Proc Natl Acad Sci USA 2001, 98, 404–409.34. Ferrer-Cancho, R.; Sole, R.V. The small world of human language. Procs Roy Soc London B 2001a, 268, 2261–2266.35. Wagner, A. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. J Molec Evol 2001, 18, 1283–1292.36. Edelman, G.M.; Gally, J.A. Degeneracy and complexity in biology systems. Proc Natl Acad Sci USA 2001, 98, 13763–13768.37. Tononi, G.; Sporns, O.; Edelman, G.M. Measures of degeneracy and redundancy in biological networks. Proc Natl Acad Sci USA 1999, 96, 3257–3262.38. Hartwell, L.H.; Hopfield, J.J.; Leibler, S.; Murray, A.W. From molecular to modular cell biology. Nature 1999, 42 supp, c47–252.39. von Dassow, G.; Meir, E.; Munro, E.M.; Odell, G.M. Nature 2000, 406, 188–194.40. Sole, R.V.; Salazar, I.; Newman, S.A. Gene network dynamics and the evolution of development. Trends Ecol Evol 2000, 15, 479–480.41. Sole, R.V.; Salazar, I.; Garcia-Fernandez, J. Common Pattern Formation, Modularity and Phase Transitions in a Gene Network Model of Morphogenesis. Physica A

2002a, 305, 640–647.42. Wagner, G. Adaptation and the modular design of organisms. In: Advances in Artificial Life, F. Moran, J.J. Merelo and P. Chacou, Eds. Springer-Verlag: Berlin, 1995.43. Lodish, H.; Berk, A.; Zipursky, S.L.; Matsudaira, P. Molecular Cell Biology (4th edition). W. H. Freeman: New York, 2000.44. Bray, D. Protein molecules as computational elements in living cells. Nature 1995, 376, 307–312.45. Wagner, A. Mutational robustness in genetic networks of yeast. Nature Genetics 2000, 24, 355–361.46. Maslov, S.; Sneppen, K. Specificity and stability in topology of protein networks. Science 2002, 296, 910–913.47. Kohn, K.W. Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol Biol Cell 1999, 10, 2703–2734.48. Vogelstein, B.; Lane, D.; Levine, A.J. Surfing the p53 network. Nature 2000, 408, 307–310.49. Ohono, S. Evolution by gene duplication. Springer: Berlin, 1970.50. Sole, R.V.; Pastor-Satorras, R.; Smith, E.; Kepler, T. A model of large-scale proteome evolution. Adv Complex Syst 2002b, 5, 43–54.51. Vazquez, A.; Flammini, A.; Maritan, A.; Vespignani, A. Modelling of protein interaction networks. cond-mat/0108043, 2001.52. Pastor-Satorras, R.; Smith, E.; Sole, R.V. Evolving protein interaction networks through gene duplication. J Theor Biol, in press, 2002.53. Wagner, A.; Fell, D.A. The small world inside large metabolic networks. Proc Roy Soc London B 2001, 268, 1803–1810.54. Podani, J.; Oltvai, Z.N.; Jeong, H.; Tombor, B.; Barabasi, A.-L.; Szathmary, E. Comparable system-level organization of Archaca and Eukaryotes. Nature Genetics

2001, 29, 54–55.55. Israel, L. Tumor progression: random mutations or and integrated survival response to cellular stress conserved from unicellular organisms? J Theor Biol 1996,

178, 375–380.56. Jin, A.; Martinek, S.; Joo, W.S.; et al. Identification and characterization of a p53 homologue in Drosophila melanogaster. Proc Natl Acad Sci USA 2000, 97,

7301–7306.57. Brown, J.H. Complex ecological systems. In: Complexity: Metaphors, Models and Reality (Cowan, G.A., Pincs, D., Meltzer, D., Eds.; Addison-Wesley: Reading, MA,

1994, p 419–443.58. Pimm, S.L. The balance of nature? Chicago University Press: Chicago, 1991.59. Abrams, P.; Menge, B.A.; Mittelbach, G.C.; Spiller, D.; Yodzis, P. The role of indirect effects in food webs. In: Food webs: integration of patterns and dynamics (Polis,

G.A. and Winemiller, K.O., Eds.); Chapman & Hall: New York, 1996, pp 371–395.60. Williams, R.J.; Martinez, N.D.; Berlow, E.L.; Dunne, J.A.; Barabasi, A.-L. Two degrees of separation in complex food webs. Proc Natl Acad Sci USA 2001, 99,

12913–12916.61. Montoya, J.M.; Sole, R.V. Small world patterns in food webs. J Theor Biol 2002, 214, 405–412.62. McCann, K.S. The diversity-stability debate. Nature 2000, 405, 228–233.63. Camacho, J.; Guimera, R.; Amaral, L.A.N. Robust patterns in food web structure. Phys Rev Lett 2002, 88, 228102.64. Montoya, J.M.; Sole, R.V. Topological properties of food webs: from real data to assembly models. Santa Fe Institute Working Paper 01-11-069, 2001.65. Montoya, J.M.; Sole, R.V.; Pimm, S.L. Unpublished.66. Rutledge, R.; Basore, B.; Mulholland, R. Ecological stability: an information theory viewpoint. J Theo Biol 1976, 57, 355–371.67. Walker, B.H. Biodiversity and ecological redundancy. Conservation Biology 1992, 6, 18–23.68. Naeem, S. Species redundancy and ecosystem reliability. Conservation Biology 1998, 12, 39–45.69. Fonseca, C.R.; Ganade, G. Species functional redundancy, random extinctions and the stability of ecosystems. J An Ecol 2001, 89, 118–125.70. Borvall, C.; Ebenman, B.; Jonson, T. Biodiversity lessens the risk of cascading extinctions. Eco Let 2000, 3, 131–136.71. Ulanowicz, R.E. Growth and Development: ecosystems phenomenology. Springer: New York, 1986.72. Ernest, S.K.M.; Brown, J.H. Homeostasis and compensation: the role of resources in ecosystem stability. Ecology 2001, 82, 2118–2132.73. Brown, J.H.; Heske, E.J. Control of a desert-grassland transition by a keystone rodent guild. Science 1990, 250, 1705–1707.74. Maynard-Smith, J.; Szathmary, E. The major transitions in evolution. Oxford University Press, 1997.75. Deacon, T.W. The symbolic species: the coevolution of language and the brain. W. W. Norton & Co.: New York, 1997.76. Sigman & Cechi. Global organization of the Word-Net lexicon. Proc Natl Acad Sci USA 2002, 99, 1742–1747.77. Steyvers, M.; Tenenbaum, J.B. The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Submitted to Cognitive Science.

http://www-psych.stanford.edu/jbt/, 2002.78. Traugot, E.C.; Heine, B. Approaches to grammaticalization. Volume I. John Benjamins Publishing Company: Amsterdam/Philadelphia, 1991.79. Hopper, P.J. On some principles of grammaticization. In Approaches to grammaticalization (Traugot, E.C. & Heine, B., Eds.). Volume I. John Benjamins Publishing

Company: Amsterdam/Philadelphia, 1991.80. Prun, C. G. K. Zipf’s conception of language as an early prototype of synergetic linguistics. Journal of Quantitative Linguistics, 1999, 6, 78–84.81. Zipf, G.K. Human Behavior and the Principle of Least Effort. An Introduction to Human Ecology. New York: Hafner reprint, 1972. [1st edition: Cambridge, MA:

Addison-Wesley, 1949].

32 C O M P L E X I T Y © 2003 Wiley Periodicals, Inc.

Page 49: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

82. Miller and Gildea. How children learn words. Scient Am 1987, 257, 94–99.83. Ferrer-Cancho, R.; Sole, R.V. Optimization in complex networks. cond-mat/0111222, 2001b.84. Caplan, D. Language. Structure, Processing and Disorders. MIT Press, 1994.85. Pastor-Satorras, R.; Vespignani, A. Epidemic spreading in scale-free networks. Phys Rev Lett 2001, 86, 3200–3203.86. Nelson, M.E.; Bower, J.M. Brain maps and parallel computers. Trends Neurosci 1990, 13, 403–408.87. Kauffman, S.A. At Home in the Universe, Oxford University Press: New York, 1995.88. Mathias, N.; Gopal, V. Small Worlds: How and Why. Phys Rev E 2001, 63, 21117.89. Perry, D.E.; Wolf, A.L. Foundations for the Study of Software Architecture. ACM SIGSOFT Software Engineering Notes, 1992, 17, 4.90. Valverde, S.; Ferrer-Cancho, R.; Sole, R.V. Scale-Free Networks from Optimal Design. Europhys Lett in press, 2002.91. Pressman, R.S. Software Engineering: A Practitioner’s Approach. McGraw-Hill, 1992.92. Zuse, H. A Framework for Software Measurement, Walter de Gruyter: Berlin, New York, 1998.93. Gamma, E.; Helm, R.; Johnson, R.; Vlissides, J. Design Patterns, Addison-Wesley: Reading, MA, 1994.94. Brown, W.H.; et al. Anti-Patterns: Refactoring Software, Architectures, and Projects in Crisis. John Wiley & Sons: New York, 1998.95. Barabasi, L.-A.; Albert, R. Emergence of scaling in random networks. Science 1999, 286, 509–512.

© 2003 Wiley Periodicals, Inc. C O M P L E X I T Y 33

Page 50: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

50 CHAPTER 4. ARTICLES

4.2 Information Theory of Complex Networks: On Evo-lution and Architectural Constraints

Page 51: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks:On Evolution and Architectural Constraints

Ricard V. Sole and Sergi Valverde

1 Complex Systems Lab-ICREA, Universitat Pompeu Fabra (GRIB),Dr Aiguader 80, 08003 Barcelona, Spain

2 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501, USA

Abstract. Complex networks are characterized by highly heterogeneous distributionsof links, often pervading the presence of key properties such as robustness under noderemoval. Several correlation measures have been defined in order to characterize thestructure of these nets. Here we show that mutual information, noise and joint en-tropies can be properly defined on a static graph. These measures are computed for anumber of real networks and analytically estimated for some simple standard models.It is shown that real networks are clustered in a well-defined domain of the entropy-noise space. By using simulated annealing optimization, it is shown that optimallyheterogeneous nets actually cluster around the same narrow domain, suggesting thatstrong constraints actually operate on the possible universe of complex networks. Theevolutionary implications are discussed.

1 Introduction

Many complex systems are to some extent describable by the network of inter-actions among its components. Beyond the specific features displayed by eachnet, it has been shown that a number of widespread properties are common tomost of them. One is the presence of the small-world phenomenon and the sec-ond the observation that in many cases they are highly heterogeneous in theirconnectivity patterns [1-4].

Heterogeneity can be easily identified by looking at the so called degree dis-tribution Pk, which gives the probability of having a node with k links. Mostcomplex networks (both natural and artificial) can be described by a degreedistribution Pk ∼ k−γφ(k/ξ) where φ(k/ξ) introduces a cut-off at some char-acteristic scale ξ [5]. An example of such scale-free networks is provided by thearchitecture of digital electronic circuits (Fig. 1). It has been shown [6] that thesesystems exhibit long tail distributions of links, where the nodes are electroniccomponents and the links are physical wires between units. Most elements areconnected to a few others (for circuits this usually means nearest neighbors) butsome are connected to many others. In Fig. 2 several examples of the observeddistributions for both analog (a-b) and digital (c-d) systems are shown. Althoughanalog systems are closer to an exponential distribution (i. e. ξ small) digital,large-scale systems3 exhibit scaling behavior, with γ ∼ 3.3 Similar results have been obtained by looking at large-scale systems, such as VLSI

networks. See for example: http://citeseer.nj.nec.com/450707.html

R.V. Sole and S. Valverde, Information Theory of Complex Networks: On Evolution and Architec-tural Constraints, Lect. Notes Phys. 650, 189–207 (2004)http://www.springerlink.com/ c© Springer-Verlag Berlin Heidelberg 2004

Page 52: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

190 R.V. Sole and S. Valverde

Fig. 1. Heterogeneity is a widespread feature of most (but not all) complex networks.An example from technology graphs are electronic circuits (upper plot) which havebeen shown to display scale-free distributions of links.

0 5 10 15 20 25 3010

-4

10-3

10-2

10-1

100

Cum

ulat

ive

dist

ribu

tion

100

101

102

10-3

10-2

10-1

100

0 10 20 30 40 50k

10-4

10-3

10-2

10-1

100

Cum

ulat

ive

dist

ribu

tion

100

101

102

k

10-4

10-3

10-2

10-1

100

A B

C D

-2.05

-2.1

Fig. 2. Cumulative degree distributions for several examples of analogic (a-b) anddigital (c-d). Although the analogic systems are less heterogeneous, analogic circuits(particularly large systems) display scaling in their degree distributions (Left plots arelinear-log and right plots are in log-log scale).

Page 53: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 191

Scale-free nets have been shown to be obtainable through a number of mech-anisms, including preferential attachment [2,3,7,8], optimization [9,10], dupli-cation and divergence [11,12] or fitness-dependent, rich-gets-richer mechanisms[13] or the “copying” model [14]. Beyond the common qualitative architectureshared by these systems, the dynamical patterns and their time scales that takeplace on top of these webs differs from system to system, although in a way oranother deals with information propagation and/or processing. Moreover, theresponse to node removal differs from system to system. Although genetic andmetabolic networks seem to be fairly robust against perturbations of differenttypes, a totally different situation arises in electronic circuits. In biological netsfailure of highly connected components will typically end in system’s failure (forexample, at the cellular level). But failure (by mutation or transient change)of a gene is often buffered by the rest of the system. This is not the case forelectronic circuits and, to a similar extent, by software networks. Failure of anycomponent typically leads to system’s failure, no matter how much linked is thegiven unit.

Several quantitative measures can be used in order to characterize a givennetwork. The first step is to define an appropriate representation in terms ofa graph Ω, defined by a pair Ω = (W,E), where W = si, (i = 1, ..., N)is the set of N nodes (species, proteins, neurons, etc) and E = si, sj isthe set of edges/connections between nodes. The adjacency matrix ξij indicatesthat an interaction exists between two nodes si, sj ∈ Ωp (ξij = 1) or that theinteraction is absent (ξij = 0). Several statistical properties, such as averagedegree, clustering or diameter can be defined from the adjacency matrix.

But the universe of possible networks (Fig. 3), although not arbitrarily di-verse, displays a number of structural variations that cannot be compressedby the previous average quantities. Real networks are not only typically het-erogeneous, but they also involve other types of features, such as hierarchicalorganization [15].

In Fig. 3 we qualitatively summarize the basic types of network organizationby using a generic, qualitative parameter space. Here heterogeneity, modularityand randomness define three axes. Assuming that such three parameters canbe properly defined, different real and model graphs can be located at differentlocations. The current knowledge of network architecture in many different sys-tems strongly indicates that the domain of random networks with long taileddegree distributions and some amount of modular structure are rather denselyoccupied. In spite that the evolutionary process leading to these different sys-tems are rather diverse, it is interesting to see that there is a strong convergencetowards this type of architectures. Here we will explore this problem by usinginformation-based statistical measures.

2 Measuring Correlations

Beyond the degree distribution and average statistical measures, correlation mea-sures offer considerable insight into the structural properties displayed by com-

Page 54: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

192 R.V. Sole and S. Valverde

a b c

Randomness Modularity

Het

erog

enei

ty

mesh

Electronic

cortical mapsfood webs

Mutualistic webs

circuits

proteomesoftware graphs

metabolic maps

Internet

Semantic nets

ER graphregular treesmodular ER graph

SF−like networks hierarchical modular

Fig. 3. A zoo of complex networks. In this qualitative space, three relevant character-istics are included: randomness, heterogeneity and modularity. The first introduces theamount of randomness involved in the process of network’s building. The second mea-sures how diverse is the link distribution and the third would measure how modular isthe architecture. The position of different examples are only a visual guide. The domainof highly heterogeneous, random hierarchical networks appears much more occupiedthan others. Scale-free like networks belong to this domain.

plex networks. One particularly interesting is network asortativeness [16]. Somenetworks show assortative mixing (AM): high degree vertices tend to attach toother high-degree vertices. At the other extreme there are graphs displaying dis-sortative mixing (DM), thus involving anticorrelation. The later are common inmost biological nets, whereas the former are common in social and collaborationnetworks. It has been suggested that the presence and sign of assortativeness inthis nets can have deep implications to their resilience under node removal ordisease propagation.

Following a previous analysis [16] we will be interested here not in the degreedistribution Pk but instead in the remaining degree: the number of edges leavingthe vertex other than the one we arrived along (Fig. 4). This new distributionq(k) is obtained from:

q(k) =(k + 1)Pk+1

〈k〉 (1)

where 〈k〉 =∑k kPk. In a network with no assortative (or disassortative) mixing

qc(j, k) takes the value q(j)q(k). If there is assortative mixing, qc(j, k) will differfrom this value and the amount of assortative mixing can be quantified by the

Page 55: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 193

k =4iq =3

i

Si

k =3jq =2j

S j

Fig. 4. Computing correlations in a network. Here two given, connected nodes si, sj

are shown, displaying different degrees ki, kj . Since we are interested in the remainingdegrees, a different value needs to be considered (here indicated as qi, qj).

connected degree-degree correlation function

〈jk〉 − 〈j〉 〈k〉 =∑

jk

jkqc(j, k)−

j

jq(j)

2

(2)

where 〈. . . 〉 indicates an average over edges.The correlation function is zero for no assortative mixing and positive or

negative for assortative or disassortative mixing respectively. In order to com-pare different networks, normalization is obtained by dividing it with by itsmaximal value, which it achieves on a perfectly assortative network, i.e., onewith qc(j, k) = q(k)δjk. This value is equal to the variance σ2

q =∑k k

2q(k) −[∑

k kq(k)]2 of the distribution q(k), and hence the normalized correlation func-

tion is

r =1σ2q

jk

jkqc(j, k)−

j

jq(j)

2

(3)

As defined from the previous equation, we have−1 < r < 0 for DM and 0 < r < 1for AM. Both biological ad technological nets tend to display DM, whereas socialwebs are clearly assortative.

Correlation functions have been widely used both in statistical physics [17]and nonlinear dynamics [18]. A closely related, and more general approach in-volves the use of information-based measures [19-21]. One specially importantquantity is the so called mutual information, which is a general measure of de-pendence between two variables [19,22]. Correlation functions measure linearrelations, whereas mutual information measures the general dependence and isthus a less biased statistic. The relevance of this difference is illustrated by theanalysis of chaotic dynamical systems: the second allows to determine the in-dependent variables for (re-) constructing phase trajectories [23]. This cannotbe done from linear correlation functions. Additionally, the definition of mutualinformation within the context of communication channels implies additionalstatistical quantities (such as channel entropy and noise) that provide a detailedcharacterization of system’s complexity. Here we show how these quantities canbe properly defined for complex networks, how they correlate with other statis-tical measures and what is their meaning and implications.

Page 56: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

194 R.V. Sole and S. Valverde

3 Entropy and Information

By using the previous distribution q = (q(1), ..., q(i), ..., q(N)), an entropy mea-sure H(q) can be defined:

H(q) = −N∑

k=1

q(k) log(q(k)) (4)

The entropy of a network will be a measure of uncertainty [19]. Within thecontext of complex nets, it provides an average measure of network’s hetero-geneity, since it measures the diversity of the link distribution. The maximumis Hmax(q) = logN is obtained for q(i) = 1/N(∀i = 1, ..., N) and Hmin(q) = 0which occurs when q = (1, 0., , , 0). In an information channel, there is a dis-tinction between source and destination. Given the symmetric character of oursystem, no such distinction is made here. In Fig. 5 we can see the impact ofheterogeneity on entropy. Specifically, we computed the entropy H(q; γ, ξ) forγ ∈ (2, 3) and ξ ∈ (0, 50) for a distribution Pk ∼ k−γφ(k/ξ) using different scal-ing exponents γ and cut-offs ξ. The impact of diversity (long tails) is obvious,increasing the uncertainty. As the scaling exponent increases or the cut-off de-creases, the network becomes less heterogeneous and as a result a lower entropyis observed.

Similarly, the joint entropy can be computed by using the previous jointprobabilities:

H(q,q′) = −N∑

k=1

N∑

k′=1

qc(k, k′) log qc(k, k′) (5)

2

2.5

3

0 10 20 30 40 50

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ξ

Ent

ropy

H(q

)

γ

Fig. 5. Entropy of the remaining degree distribution obtained from a network withdegree distribution Pk ∼ k−γφ(k/ξ). Here H(q) = − ∫ Pk log Pkdk is shown againstthe scaling exponent γ and the cut off ξ. Here we have used an exponential cut-off,i. e. φ(k/ξ) = exp(−k/ξ) As expected, the entropy becomes larger for smaller γ anddecreases as ξ is reduced.

Page 57: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 195

Here qc(k, k′) is the joint probability, and it is normalized, i. e.:

N∑

k=1

N∑

k′=1

qc(k, k′) = 1 (6)

Since it considers all possible pairs of edges, this entropy provides a measure ofthe average uncertainty of the network. As before, it can be understood in termsof a measure of the diversity of linked pairs with given remaining degrees.

The mutual information I(qk) of a given system is defined by means of thedifference:

I(q) = H(q)−Hc(q|q′) (7)

where the last term Hc(q|q′) is the conditional entropy that involves a differentset of conditional probabilities π(k|k′) [19]. They give the probability of observinga vertex with k edges leaving it provided that the vertex at the other end of thechosen edge has k′ leaving edges. This entropy (the “noise” in our graph) isdefined as:

Hc(q|q′) = −N∑

k=1

N∑

k′=1

q(k)π(k|k′) log π(k|k′) (8)

Since the conditional and joint probabilities are related through:

π(k|k′) =qc(k, k′)q(k′)

(9)

the conditional entropy can actually be computed in terms of the two previousdistributions:

Hc(q|q′) = −N∑

k=1

N∑

k′=1

qc(k, k′) logqc(k, k′)q(k′)

(10)

we thus have, from the previous expressions,

I(q) = H(q)−Hc(q|q′) (11)

= −N∑

k=1

N∑

k′=1

qc(k, k′) log q(k) +N∑

k=1

N∑

k′=1

qc(k, k′) log π(k|k′)

= −N∑

k=1

N∑

k′=1

qc(k, k′) logq(k)π(k|k′)

which gives a final form for the information transfer function4:4 The previous measures can be extended (with some care) into continuous distribu-

tions. In this case, we must assume that the continuous counterparts of the pre-

Page 58: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

196 R.V. Sole and S. Valverde

I(q) =N∑

k=1

N∑

k′=1

qc(k, k′) logqc(k, k′)q(k)q(k′)

(12)

Some limit cases are of interest here. The first corresponds to the maximuminformation transfer, which is obtained, for a given qk, when Hc(q|q′) = 0,i. e. when the conditional probabilities are such that π(k|k′) = 1 or 0 for allk, k′ = 1, ..., N . Another is given by π(k|k′) = δk,k′ . This case corresponds toa deterministic channel in standard information theory [19]. This implies thatqc(k, k′) = q(k′)δk,k′ which is precisely the case of perfectly assortative network[16].

In analogy with information channels, we can find a maximum value of theinformation, which we call the network’s capacity C = maxqk I(q). There is nogeneral method to compute C for an arbitrary channel. It can only be computedin some specific cases.

By using the previous functions, we will measure three key quantities: (a)the amount of correlation between nodes in the graph, as measured by the in-formation; (b) the noise level, as defined by the conditional entropy, which willprovide a measure of assortativeness and (c) the entropy of the q(k) distribu-tion. Since the total information involves the two last terms in a linear fashion, anoise-entropy space will be constructed and the distribution of real nets on thisspace will be analysed.

4 Model Networks

In the following sub-sections some simple, limit cases will be considered. Differenttypes of architectures are represented by some standard networks exhibitingdifferent degrees of heterogeneity and randomness. The list is far from exhaustivebut provides an idea of what are the effects of each ingredient on informationtransfer and entropies.

4.1 Lattices and Trees

Lattice-like networks are common in some man-made architectures, particularlyparallel computers [24-26]. These nets represent the highest degree of homogene-ity and have no randomness. For a lattice, we have Pk = δk,z, where z is a fixednumber of links per node and δij the Kronecker’s delta function. For this orderedgraph ΩL, we have

vious degree distributions can be defined. The new distributions are such thatthe normalization conditions:

∫q(k)dk = 1 and

∫ ∫qc(k, k′)dkdk′ = 1 are at

work. Provided that the distributions are well behaved, the information trans-fer is now given by I(q) =

∫ ∫qc(k, k′) log

(qc(k,k′)q(k)q(k′)

)dkdk′. Accordingly, entropy

and noise would be obtained from: H(q) = − ∫ q(k) log(q(k))dk and Hc(q|q′) =− ∫ ∫ qc(k, k′) log π(k|k′)dkdk′

Page 59: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 197

a b c

Fig. 6. Homogeneous networks: here two examples of a lattice (a) and a regular tree(b) are shown as examples of deterministic nets. In both cases each node has the samedegree and thus both the entropy and the noise are zero. In (c) a random, Erdos-Renyi graph is shown. Here some amount of heterogeneity is at work, but the varianceequals the mean and both noise and entropy are very close, giving as a result a smallinformation i. e. no correlations (in the N → ∞ limit).

q(k) = δk,z−1 (13)

qc(k, k′) = δk,z−1δk′,z−1 (14)

and thus

I(q) = H(q) = Hc(q|q′) = 0 (15)

This is a trivial case, since the homogeneous character of the degree distributionimplies zero uncertainty. The same situation arises for a Cayley tree (Bethelattice), where each node has exactly the same degree. Tree-like architecturesare also common in designed systems, such as small-sized software graphs [27]and communication networks.

4.2 Erdos-Renyi Graphs

Erdos-Renyi graphs ΩN,p are random graphs such that two nodes are joined withsome probability p. These types of graphs have been widely used as the backboneof null models of genetic [28] ecological [29] and neural [30] networks. It seemsalso appropriate in describing the topology of species-poor ecosystems [31]. Thedistributions are single-scaled and thus low uncertainty and high randomness areat work. The average degree will be 〈k〉 ≈ pN , and it can be easily shown thatthe probability Pk that a vertex has a degree k follows a Poisson distributionPk = e〈k〉〈k〉−k/k!, and thus

q(k) =(k + 1)e〈k〉

〈k〉(k + 1)!〈k〉(k+1) = Pk (16)

For this random graph, the independence associated to the link assignment im-plies (for N large) qc(k, k′) = q(k)q(k′) and thus information transfer is zero.

An interesting extension of the standard ER graph allows to introduce mod-ularity into the graph structure [15]. In general, the graph Ω is partitioned intom subgraphs Ωi, (i = 1, ...,m) of relative size ηi = |Ωi| such that

Page 60: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

198 R.V. Sole and S. Valverde

W =m⋃

i=1

Wi (Wj ∩Wk = ∅) (17)

and such that∑

i

ηi = 1 (18)

All nodes sj ∈ Ωi are connected with probability q and additionally we havea probability p of connecting two nodes belonging to different modules. Theaverage degree of this system is < k >= pN/m + (m − 1)Nq/m. Given therandom wiring, it is not difficult to show that for large N information will betypically very small.

4.3 Star Graph

Star graphs define another extreme within the universe of complex nets. Al-though no real network is likely to be described in terms of a pure star graph,it is certainly a common motif in many graphs. They are largely responsible forthe short distances achieved in SF networks. Besides, a star graph can be shownto be optimal for low-cost communication [32].

This graph Ω∗ is characterized by a degree distribution:

p(k) =n− 1n

δk,1 +1nδk,n−1 (19)

The corresponding distribution q(k) is:

q(k) =12

[δk,0 + δk,N−1] (20)

and the joint probabilities are reduced to:

qc(k, k′) = δk,N−2δk′,0 (21)

The entropy is maximal, given by:

H(q) = −q(0) log q(0)− q(N − 1) log q(N − 1) (22)

which gives H(q) = log 2. The noise term is Hc(q|q′) = 0, since π(k|k′) = δkk′ .The information is thus maximal, with I(q) = H(q) = log 2. The star graphdisplays maximum information, as expected given the deterministic character ofthe conditional probabilities.

5 Real Networks

In this section we present some analysis of the information measures as appliedto real networks. A large set of both technological and biological graphs has beenstudied. Specifically, three groups of data sets were used in our analysis, all ofthem known to be highly heterogeneous displaying scale-free architecture:

Page 61: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 199

0 0.2 0.4 0.6 0.8 1 1.2Information I(q)

-0.4

-0.2

0

0.2

Ass

ort

ativ

eness

r

Fig. 7. Information transfer and assortativeness appear to be roughly correlated in anegative way. Here several systems have been used (all those analysed in this paper)and a linear interpolation has been displayed. Although the trend is clear, consider-able variance can be appreciated, probably due to the underlying nonlinear mappingbetween both measures.

1. Metabolic networks: a graph theoretic representation of the biochemical re-actions taking place in a given metabolic network can be easily constructed.In this representation, a metabolic network is built up of nodes, the sub-strates, that are connected to one another through links, which are the actualmetabolic reactions [33].

2. Software class diagrams: Nodes are software components and links are re-lationships between software components. Class diagrams constitute a well-known example of such graphs [34,35].

3. Electronic circuits: they can be viewed as networks in which vertices (ornodes) are electronic components (e.g. logic gates in digital circuits andresistors, capacitors, diodes and so on in analogic circuits) and connections(or edges) are wires in a broad sense [6].

In Table 1 we also show a list of selected networks obtained from very differentsystems and ordered from the higher to the lower information. The system’ssize N , average connectivity < k >, information measures and the assortativemixing coefficient r are provided. We can see that most nets are disassortative, aspredicted in [16]. Actually, information and r appear to be negatively correlated.This is shown in Fig. 7, where r is shown against I(q) for different systems.

It is important to see that, in spite of the roughly negative correlation (alinear interpolation has been used) a large variance is observable, and a range ofr values is associated to each information transfer. Such a variable plot is likely

Page 62: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

200 R.V. Sole and S. Valverde

0.0 1.0 2.0 3.0 4.0 5.0 6.0H(q)

0.0

1.0

2.0

3.0

4.0

5.0

H(q

|q’)

Software

Metabolism

Electronic circuits

Homogeneous graphs Star graph

b

a

Fig. 8. Noise-entropy plot for different real networks, both natural and artificial. Hereelectronic circuits (open circles), metabolic (triangles) and software maps (squares) areshown to be close to the zero-information line, i. e. when entropy equals to noise. Anexample of a software graph that significantly deviates from the H = Hc. The networkis small and has a rather particular shape, involving a large hub plus another clusterof connected classes.

to be the result of the nonlinear character of the information transfer, not sharedby the (linear) correlation defined by assortative mixing measures.

By displaying noise against entropy, the general picture that emerges is thatthe set of complex networks analysed here displays typically uncorrelated struc-ture. This is clear from the strongly linear dependence shown between noise andentropy (Fig. 8). If two given, randomly chosen nodes with remaining degreesk, k′ are typically connected with some probability, roughly irrespective of theirmutual degree (i. e. low assortativeness is present) we should expect:

qc(k, k′) ≈ q(k)q(k′) (23)

and thus we would have

π(k|k′) ≈ q(k) (24)

in this case, the noise will be given by:

Hc(q|q′) = −N∑

k=1

N∑

k′=1

q(k, k′) log π(k|k′) (25)

= −N∑

k=1

N∑

k′=1

q(k, k′) log q(k) = H(q) (26)

Statistical independence among node degree thus gives a predicted straightline Hc(q|q′) = H(q) which seems to be suggested by our data. One first conclu-

Page 63: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 201

Table 1. Information-based measures computed for different real and theoretical sys-tems. For each subset the list is ordered from higher to lower information transfer.

Network type N < k > I(q) H(q) Hc(q|q′) r

Technological networks

Software 1 168 2.81 1.19 3.04 1.85 -0.39Software 2 159 4.19 1.03 3.99 2.97 -0.41Internet AS 3200 3.56 0.50 4.77 4.27 -0.22Software 3 1993 5.00 0.30 4.82 4.51 -0.08Circuit TV 320 3.17 0.23 1.37 1.14 0.010Circuit EC05 899 4.14 0.15 2.98 2.82 -0.15Software linux 5285 4.29 0.12 4.47 4.35 -0.06Powergrid 4941 2.67 0.06 3.01 2.95 0.003

Biological networks

Silwood park 154 4.75 0.94 4.09 3.14 -0.31Ythan estuary 134 8.67 0.53 4.74 4.21 -0.24p53 subnetwork 139 5.09 0.46 4.00 3.54 -0.24Metabolic map 1173 4.84 0.39 3.58 3.19 -0.17Neural net (C.elegans) 297 14.5 0.37 5.12 4.74 -0.16Metabolic map 821 4.76 0.37 3.46 3.09 -0.18Romanian syntax 5916 5.65 0.31 5.45 5.14 -0.18Proteome map 1458 2.67 0.24 3.85 3.61 -0.21

Theoretical systems

Star graph 17 1.88 1.00 1.00 0.00 -1.00Barabasi-Albert 3000 3.98 0.25 4.12 3.85 -0.078Erdos-Renyi 300 6.82 0.06 3.31 3.25 -0.005Modular E-R 500 10.3 0.04 3.67 3.62 -0.001

sion from this analysis is that network correlations in real graphs are small, beingthe diversity of pairs of linked nodes a direct consequence of the heterogeneouscharacter of the degree distribution and nothing else. In spite that the cloud ofpoints deviates from the straight line, these deviations might result from finite-size effects. Actually, if we plot information measures I(q;N) against system’ssize N , it can be shown that they follow a scaling I(q;N) ∼ N−1.

Two points clearly deviate from the general pattern displayed by the majorityof networks analysed here. Both are small systems and correspond to softwaregraphs, and one of them is shown in Fig. 8b. As we can see this is a ratherpeculiar system, involving a large hub connected to a small module. It is thusa small structure dominated by the star graph component together with a ho-mogeneous component. Such a nonuniform structure is likely to result from anprocess dealing with engineered, small-sized systems but unlikely to result from

Page 64: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

202 R.V. Sole and S. Valverde

a natural process or from artificial evolution when some complexity thresholdsare reached.

One possible explanation for the previous result is that correlations simply donot play any particular role in shaping network architecture5. However, it couldbe also argued that such a lack of correlation has been either chosen or selectedfor some underlying reason. But there’s also another (more likely) scenario: thatthe observed structures are actually the only possible choices, at least when somecomplexity threshold is reached.

6 Simulated Annealing Search

The spread of real networks close to the zero-information boundary suggests thatthe possible structures allowed to occur (with a given heterogeneity and a givencorrelation) is rather constrained. This might be a consequence of the irrelevanceof correlations for these systems but it would also be the case that some selectivepressure is made towards heterogeneous networks with small correlations (i. e.no assortativeness).

In order to test the previous idea we can perform a Monte Carlo search innetwork space. Specifically, we explore the space of possible pairs entropy-noiseavailable to candidate graphs Ω, i. e. Γ = H(q), Hc(q|q′), which is constrainedby two well-defined boundaries6:

∂1Γ = (H(Ω), Hc(Ω)) | Hc(Ω) = 0 (27)∂2Γ = (H(Ω), Hc(Ω)) | H(Ω) = Hc(Ω) (28)

where H(Ω) and Hc(Ω) indicate the entropy and noise associated to a givengraph Ω. It is not difficult to show that only two points occupy the lower bound-ary, i. e. ∂1Γ = (0, 0), (log 2, 0). These correspond to purely homogeneousgraphs and the star graph. The second boundary has already been studied.

For every random sample point (H,Hc), an optimizing searching processlooks for candidate networks that minimize the error term or potential functionU(Ω):

U(Ω) =√

(H −H(Ω))2 + (Hc −Hc(Ω))2 (29)

Here, we use the Boltzmann strategy presented in [36,37]. The algorithm ex-plores the search space defined by all possible networks of N nodes. We assumethat every possible state visited by the search process can be properly charac-terized by the scalar Ui. In the stationary limit (for a large number of searchers)5 This conclusion is reached under our specific, quasi-local definition of remaining

degree. Other approaches, considering instead shortest paths among nodes mightreveal important differences

6 Strictly speaking, we are considering the entropies associated to the remaining degreedistribution of a graph sampled from some graph ensemble by a stochastic process

Page 65: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 203

define the occupation probability pi(t) of certain state i at time t. We requirethe optimization process to increase the occupation probability for the state ofminimal potential. In general, many local minima exist and the search could betrapped in one of these states, which is undesirable.

A dynamics that finds the minimum is given by:

dpi(t)dt

=∑

i =jAijpi(t)−Ajipj(t) (30)

where

Aij = A0ij ∗

1

exp (−(Ui − Uj)/T (t));Ui < Uj;Ui ≥ Uj (31)

is the transition probability for the searcher to move from state i to state j.The term A0

ij is 1 if and only if the state j can be reached by a little changeor mutation and 0 otherwise. Here, the valid changes involve edge addition,edge removal and edge rewire, which are all equally selected with the sameprobability. The number of nodes of the network is always fixed. Transitions tolower energy states are always accepted but local minima is avoided becausethermal fluctuations like in simulated annealing . As the search progresses, thetemperature T (t) is decreased following a power law rule:

T (t) =T0

1 + at(32)

where T0 is the initial temperature (or starting degree of disorder) and a is thecooling rate. This allows the optimization process to perform a smooth transitionfrom coarse to detailed search. The process starts from a random graph of Nnodes with a given connectivity < k > and lasts a given number of simulationsteps.

By measuring the final error ε(Ω) = U(Ω) for a large number of Monte Carlosamples it is possible to approximate the likelihood of a particular candidatenetwork. Here we have used ε = 0.003 and the optimization parameters are:N = 500, < k >= 3, T0 = 0.01, a = 0.002 and 350000 steps. Our results indicatethat the potential is effectively minimized only for a domain of pairs (H,Hc)along the second boundary ∂2Γ . In Fig. 9a the (smoothed) probability densityP(H,Hc) of optimized networks is shown (for the upper part of the parameterspace, P(H,Hc) = 0). The distribution is peaked around a domain of Γ thatfits very well the range of values satisfied by most real networks (compare withFig. 8). There is also a clearly empty zone outside this domain, indicating thatnetworks are difficult or simply impossible to find. An example of the optimizedgraphs is shown in Fig. 9b. This particular graph is scale-free, with an exponentγ ≈ 2.26 and a cut-off at ξ ∼ 50. The other networks in this domain are also SF,with an average scaling exponent < γ >∼ 2.5. By searching candidate networksthat simultaneously fit the two requirements of given entropy and noise, theonly possible solutions to be found are scale-free graphs with small levels of

Page 66: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

204 R.V. Sole and S. Valverde

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60

1

2

3

4

5

6

0

0.05

0.1

0.15

0.2

P(H

,H ) c

a

Noi

se H

(q|q

’)

Entropy H(q)

Fig. 9. (a) Noise-entropy probability plot obtained by exploring the Γ space using aMonte Carlo sampling. Different pairs of noise and entropy are generated and a simu-lated annealing search is performed looking for candidate networks. Here the smoothedprobability distribution obtained from this algorithm is shown in (a). The highest den-sity of observed networks appears to be close to the same domain observed for realnetworks. In (b) an example of a small sized network (N = 142, < k >= 2.64) isshown, together with its degree distribution (c). The cumulative degree distributionfollows power law with exponent −1.26 (i. e. γ = 2.26). The graph has been obtainedclose to the boundary H = Hc (with H(q) = 3.69, H(q|q′) = 3.09).

correlations. Interestingly, software networks deviate from this rule and are tobe found along the upper region of the boundary (H > 4), where potential isnot minimum. This might be a signature of frustrated optimization in softwaredesign processes [10].

7 Discussion

Complex networks display heterogeneous structures that result from differentmechanisms of evolution [38]. Some are created through multiplicative processes(such as preferential attachment) while others seem to be well described in termsof optimization mechanisms [9]. Our study indicates that the possible universeof complex networks is actually rather constrained. Networks display scale-freearchitecture but also small assortativeness. The search algorithm, instead of as-suming the presence of a given predefined mechanism of network growth, simplysearches for candidate solutions to an optimization algorithm trying to approachsimultaneously some amount of network heterogeneity and correlations. The re-sult is that indeed networks are scale-free and involve low degree of correlations,but such situation is constrained to a well-defined domain. This domain is re-markably similar to the one inhabited by real graphs. Outside this domain, it isnot feasible to find graphs simultaneously satisfying the two requirements.

Page 67: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 205

The impact of SF architecture on biological and artificial networks is clearlydifferent. Although the first can take advantage of the high homeostasis providedby scaling laws the second are completely dependent on the correct functioningof all units. Failure of a single diode in a circuit or of a single component ina software system leads to system collapse. Thus, homeostasis can not be ageneral explanation for scaling. We have conjectured that the leading force hereis an optimization process where reliable communication at low cost shapesnetwork architecture in first place [38]. The need of a sparse graph can be aconsequence of different requirements. In an electronic circuit, saving wire is astrong constraint. In metabolic or genetic networks, it might be important inorder to reduce the impact of unstable positive feedbacks. This can be satisfiedby means of sparse graphs displaying scale-free architecture. What is the roleof correlations? For the systems analysed here correlations don’t seem to beof relevance to network performance. But what is more important: the lack ofnetworks outside the densely populated domain is not due to some relevant,perhaps adaptive trait. It is actually a consequence of higher-level limitationsimposed to network architecture.

Such a constrained set of possibilities fits very well the view of evolution asstrongly dominated by intrinsic constraints [39-41] (see also [42] for a criticaldiscussion). Under this view, the outcome of evolutionary searches would be notany possible architecture from the set of possible patterns but a choice from anarrow subset of attainable structures.

Acknowledgments

The authors thank the members of the Complex Systems Lab for useful dis-cussions and to an anonymous referee for valuable comments. This work wassupported by a grant BFM2001-2154, FET Open Project IST DELIS and bythe Santa Fe Institute.

References

1. R. Albert and A.-L. Barabasi. Statistical Mechanics of Complex Networks. Rev.Mod. Phys. 74, 47-97 (2002a).

2. S. N. Dorogovtsev andJ. F. F. Mendes. Evolution of networks. Adv. Phys. 51,1079-1187 (2002).

3. S. N. Dorogovtsev andJ. F. F. Mendes. Evolution of Networks: from biological netsto the Internet and WWW Oxford U. Press, Oxford (2003).

4. S. Bornholdt and H. G. Schuster, eds. Handbook of Graphs and Networks: Fromthe Genome to the Internet. Springer, Berlin (2002).

5. L. A. N. Amaral, A. Scala, M. Barthelemy and H. E. Stanley. Classes of behaviorof small-world networks. Proc. Nat. Acad. Sci. USA 97, 11149-11152 (2000).

6. R. Ferrer, C. Janssen and R. V. Sole. Topology of Technology Graphs: Small WorldPatterns in Electronic Circuits Physical Review E 64, 32767 (2001).

7. A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science286, 509-512 (1999).

Page 68: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

206 R.V. Sole and S. Valverde

8. S. N. Dorogovtsev andJ. F. F. Mendes. Accelerated growth of networks, in: Hand-book of Graphs and Networks: From the Genome to the Internet, eds. S. Bornholdtand H.G. Schuster. pp. 320-343 Wiley-VCH, Berlin (2002).

9. R. Ferrer and R. V. Sole. Optimization in Complex Networks, Lect. Notes Phys.625, 114–125 (2003).

10. S. Valverde, R. Ferrer and R. V. Sole. Scale free networks from optimal designEurophys. Lett. 60, 512-517 (2002).

11. R. V. Sole, R. Pastor-Satorras, R., Smith, E.D. and Kepler, T. A model of large-scale proteome evolution. Adv. Complex Systems 5, 43-54 (2002).

12. A. Vazquez, A. Flammini, A. Maritan and A. Vespignani. Modeling of proteininteraction networks. Complexus, 1, 38-44 (2002).

13. G. Caldarelli, A. Capocci, P. De Los Rios and M. A. Munoz. Scale-free networksfrom varying vertex intrinsic fitness. Phys. Rev. Lett. 89, 258702 (2002).

14. F. Menczer. Growing and navigating the small world web by local content. Proc.Nat. Acad. Sci. USA 99, 14014-14019 (2002).

15. E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai and A.-L. Barabasi. Hierar-chical Organization of Modularity in Metabolic Networks. Science 297, 1551-1555(2002).

16. M. E. J. Newman. Assortative mixing in networks. Phys. Rev. Lett. 89, 208701(2002).

17. E. H. Stanley, S. V. Buldyrev, A. L. Goldberger, Z. D. Goldberger, S. Havlin, R.N. Mantegna, S. M. Ossadnik, C. K. Peng and M. Simon, Statistical mechanicsin biology: how ubiquitous are long-range correlations? Physica A205, 214-253(1996).

18. H. D. Abarbanel, R. Brown, J. L. Sidorowich and L. S. Tsimring. The analysis ofobserved chaotic data in physical systems. Rev. Mod. Phys. 65, 1331-1392 (1993).

19. R. B. Ash. Information Theory, Dover, London (1965).20. C. Adami. Introduction to Artificial Life. Springer, New York (1998).21. W. Li. Mutual information versus correlation functions. J. Stat. Phys. 60, 823-837

(1990).22. W. Li. On the relationship between complexity and entropy for Markov chains and

regular languages. Complex Syst. 5, 381-399 (1991).23. A. Fraser and H. Swinney. Independent coordinates for strange attractors from

mutual information. Phys. Rev. A33, 1134-1140 (1986).24. C. Germain-Renaud and J. P. Sansonnet. Ordinateurs massivement paralleles, Ar-

mand Colin, Paris (1991).25. V. M. Milutinovic. Computer Architecture, North Holland, Elsevier (1988).26. W. D. Hillis. The Connection Machine, MIT Press (Cambridge, MA, 1985).27. S. Valverde, R. Ferrer and R. V. Sole, Scale-free networks from optimal design.

Europhys. Lett. 60, 512-517 (2002).28. S. A. Kauffman. Origins of Order. Oxford U. Press, New York (1993).29. R. M. May. Stability and complexity in model ecosystems. Princeton U. Press, New

York (1973).30. S. Amari. Characteristics of random nets of analog neuron-like elements. IEEE

Trans. Man and Cybernetics 2, 643-657 (1972).31. J. M. Montoya and R. V. Sole. Topological properties of food webs: from real data

to community assembly models Oikos 102, 614-622 (2003).32. R. Ferrer and R. V. Sole, Optimization in Complex Networks, Lect. Notes Phys.

625, 114–125 (2003).33. H. Jeong, S. Mason, A.-L. Barabasi and Z. N. Oltvai. Lethality and centrality in

protein networks Nature 411, 41 (2001).

Page 69: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Information Theory of Complex Networks 207

34. S. Valverde and R. V. Sole. Hierarchical small worlds in sotfware architecture.Santa Fe Institute Working Paper 03-07-044.

35. C. R. Myers. Software systems as complex networks: structure, function, and evolv-ability of software collaboration graphs, Phys. Rev. E 68, 046116 (2003).

36. F. Schweitzer, W. Ebeling, H. Rose and O. Weiss. Network Optimization UsingEvolutionary Strategies, in: Parallel Problem Solving from Nature - PPSN IV,(Eds. H.-M. Voigt, W. Ebeling, I. Rechenberg, H.-P. Schwefel), Lecture Notes inComputer Science, vol. 1141, Springer, Berlin (1996) pp. 940-949.

37. F. Schweitzer. Brownian Agents and Active Particles. Springer, Berlin (2002).38. R. V. Sole, R. Ferrer-Cancho, J. M. Montoya and S. Valverde. Selection, tinkering

and emergence in complex networks. Complexity 8(1), 20-33 (2002).39. F. Jacob. Evolution as tinkering. Science 196, 1161-1166 (1976).40. P. Alberch. The logic of monsters: evidence for internal constraint in development

and evolution. Geobios 19, 21-57 (1989).41. B. C. Goodwin. How the Leopard Changed Its Spots: the Evolution of Complexity.

Charles Scribner’s Sons, New York (1994).42. S. J. Gould. The structure of evolutionary theory. Belknap, Harvard (2003).

Page 70: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

70 CHAPTER 4. ARTICLES

4.3 Information Transfer and Phase Transitions in a Modelof Internet Traffic

Page 71: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Physica A 289 (2001) 595–605www.elsevier.com/locate/physa

Information transfer and phase transitions in amodel of internet tracRicard V. Solea;b;∗, Sergi Valverdea

aComplex Systems Research Group, Departament of Physics, FEN Universitat Politecnica de Catalunya,Campus Nord B4, 08034 Barcelona, Spain

bSanta Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA

Received 4 October 2000

Abstract

In a recent study, Ohira and Sawatari presented a simple model of computer network tracdynamics. These authors showed that a phase transition point is present separating the low-tracphase with no congestion from the congestion phase as the packet creation rate increases. Wefurther investigated this model by relaxing the network topology using a random location ofrouters. It is shown that the model exhibits nontrivial scaling properties close to the criticalpoint, which reproduce some of the observed real Internet features. At criticality, the net showsmaximum information transfer and eciency. It is shown that some of the key properties ofthis model are shared by highway trac models, as previously conjectured by some authors.The relevance to Internet dynamics and to the performance of parallel arrays of processors isdiscussed. c© 2001 Published by Elsevier Science B.V. All rights reserved.

1. Introduction

The exchange of information in complex networks and how these networks evolve intime has been receiving increasing attention by physicists over the last few years [1,2].Two main lines of research have been developed: (a) the analysis of the structuralproperties displayed by trac networks [3] and (b) the analysis of dynamical patternsof information exchange. Recent studies have revealed that phase transition phenomenaarises in Internet trac and are allowed to quantitative analysis by means of appropriatetools from statistical physics [4,5].The WWW is a virtual graph connecting nodes containing dierent amounts of

information. This information ows through a physical support which also displays

∗ Corresponding author. Fax: +34-93-4017100.E-mail address: [email protected] (R.V. Sole).

0378-4371/01/$ - see front matter c© 2001 Published by Elsevier Science B.V. All rights reserved.PII: S 0378 -4371(00)00536 -7

Page 72: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

596 R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605

scale-free behavior [3]. The network of computers is itself a complex system, andcomplex dynamics has been detected suggesting that self-similar patterns are also atwork [6].Some previous studies have shown evidence for critical-like dynamics in Computer

Networks [7] in terms of fractal, 1=f noise spectrum as well as long-tail distributionsof some characteristic quantities. Some authors have even speculated about the pos-sibility that the trac of information through computer networks (such as Internet)can display the critical features already reported in cellular automata models of trac ow, such as the Nagel–Schreckenberg (NS) model [8–10]. The NS model shows thatas one increases the density of cars , a well-dened transition occurs at a criticaldensity c. This transition separates a uid phase showing no jams from the jammedphase were trac jams emerge. At the critical boundary, the rst jams are observedas back-propagating waves with fractal properties.A number of both quantitative and qualitative observations of real computer network

dynamics reveals some features of interest:

1. Extensive data mining from Internet=Ethernet trac shows that it displays long-rangecorrelations [6] with well-dened persistence, as measured by means of the Hurst ex-ponent. This analysis totally rejects the previous theoretical approach to Poisson-based(Markovian) models assuming statistical independence of the arrival process of in-formation.

2. Fluctuations in density of packets show well-dened self-similar behavior over longtime scales. This has been measured by several authors [7,11]. The power spectrumis typically a power law, although local (spatial) dierences have been shown to beinvolved.

3. The statistical properties of Internet congestion reveal long-tailed (lognormal) dis-tributions of latencies. Here latency times TL are thus given by

P(TL) =1

TL√2exp

(− ln TL22

):

Latencies are measured by performing series of experiments in which the round-triptimes of ping packets is averaged over many sent messages between two givennodes.

4. There is a clear feedback between the bottom-level where users send their messagesthrough the net and increase network activity (and congestion) and the top-level de-scribed by the overall network activity. Users are responsible for the global behavior(since packets are generated by users) and the later modies the individual decisions(users will tend to leave the net if it becomes too congested).

On the other hand, previous studies on highway trac dynamics revealed that thephase transition point presented by the models as the density of cars increased waslinked with a high degree of unpredictability [12]. Interestingly, this is maximum atcriticality [13] as well as the ow rate. In other words, eciency and unpredictabilityare connected by the phase transition. In this paper, the previous conjecture linking

Page 73: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605 597

Internet dynamics with critical points in highway trac is further explored. By con-sidering a generalization of the Ohira–Sawatari (OS) model, we show that all thepreviously reported features of real trac dynamics are recovered by the model. Thepaper is organized as follows. In Section 2, the basic model and its phase transitionis presented. In Section 3 the self-similar character of the time dynamics is shownby means of the calculation of the latency times and queue distributions as well asby means of spectral and Hurst analysis. In Section 4, the eciency and informationtransfer are calculated for dierent network sizes. In Section 5 our main conclusionsand a discussion of its implications is presented.

2. Model of computer network trac

Following the work by Ohira and Sawatari, let us consider a two-dimensional net-work with a square lattice topology with four nearest neighbors [14]. The networkinvolved two types of nodes: hosts and routers. The rst are nodes that can gener-ate and receive messages and the second can only store and forward messages. Oursquare, L× L lattice will be indicated as L(L), following previous notation [15]. Allour simulations are performed using periodic boundary conditions. In previous papers,either the hosts were distributed through the boundary [14] (and thus the inner nodeswere routers) or all nodes were both hosts and routers [15]. Here we consider a morerealistic situation, where only a fraction of the nodes are hosts and the rest arerouters [14] (Fig. 1).The location of each object, r ∈ L(L), will be given by r= icx + jcy, where cx; cy

are Cartesian unit vectors. So the set of nearest neighbors C(r) is given by

C(r) = r− cx; r+ cx; r− cy; r+ cy : (1)

Each node maintains a queue of unlimited length where the packets arriving arestored. The local number of packets will be indicated as n(r; t) and thus the totalnumber of packets in the system will be

N (t) =∑r∈L(L)

n(r; t) : (2)

The rules are dened as in the OS deterministic model (the stochastic version onlyshows the dierences already reported by these authors [14]). The rules are dened asfollows:• Creation: The hosts create packets following a random uniform distribution withprobability . Only another host can be the destination of a packet, which is alsoselected randomly. Finally, this new packet is appended at the end of the host tail.

• Routing: Each node picks up the packet at the head of its queue and decides whichoutgoing link is better suited to the packet destination. Here, the objective is tominimize the communication time for any single message, taking into account onlyshortest paths and also avoiding congested links. First, the selected link is the onethat points to a neighbour node that is nearer to the packet destination. Second,

Page 74: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

598 R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605

Fig. 1. Model network architecture (two-dimensional lattice, periodic boundary conditions). Two types ofnodes are considered: hosts (gray squares) which can generate and receive messages, and routers (opencircles) which can store and forward messages.

when two choices are possible, the less congested link is selected. The measure ofcongestion of a link is simply dened as the amount of packets forwarded throughthat link. Once the node has made the routing decision, the packet is inserted atthe end of the queue of the node selected and the counter of the outgoing link isincremented by one.These rules are applied to each site and each L×L updatings dene our time step.This model exhibits a similar phase transition than the one reported in previous

studies [14,15]. It is shown in Fig. 2 for a L = 32 system with = 0:08 (the samedensity is used in all our simulations). We can see that the transition occurs at a givenc ≈ 0:2. As it occurs with models of highway trac, the ow of packets is maximizedat criticality, as shown in Fig. 2B, where the number of delivered packets (indicatedas NDP) is plotted.

3. Scaling and self-similarity

An example of the time series at criticality for the previous system is shown inFigs. 3Aand B. It conrms our expectations and previous observation from real com-puter trac: the local uctuations in the number of packets n(r; t) are self-ane, aswe can appreciate from an enlargement of the rst plot. This is conrmed by the

Page 75: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605 599

Fig. 2. (A) Phase transition in network trac. Here L = 32 lattice has been used and the average latencyhas been computed over dierent, increasing intervals of time T steps, as indicated. The density of hosts is = 0:08. (B) As a measure of eciency, the number of delivered packets Ndp has been measured underthe same conditions. We can see the optimum at the critical point c ≈ 0:2. For ¡c we have a linearincrease Ndp = with = L2T , corresponding to the number of released packets.

calculation of the power spectrum P(f). It is shown in Fig. 4 and scales as

P(f) ≈ f− ( = 0:97± 0:06) : (3)

There is some local variability in the value of the scaling exponent through space,but it is typically inside the interval −0:75¡¡ 1:0, in agreement with data analysis[7,11].The statistics of latencies and queue lengths leads to long tails close to criticality.

Some examples of the results obtained are shown in Fig. 5. Here a L = 256 latticehas been used. Latencies are measured as the number of steps needed to travel fromemitting hosts to their destinations. The distribution of latencies close to c is a lognormal, in agreement with the study of Huberman and Adamic for round-trip times ofping packets [12]. This means that there is a characteristic latency time but also very

Page 76: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

600 R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605

Fig. 3. An example of the time-series dynamics of the number of packets n(r; t) at a given arbitrary node.Here L = 256; = 0:08 and = c = 0:055. We can see in (A) uctuations of many sizes, which displayself-anity, as we can see from (B) where the fraction of the previous time series indicated by means of awindow has been enlarged.

Fig. 4. Power spectrum P(f) computed from the time series shown in Fig. 3A. A well-dened scaling is atwork over four decades, with ≈ 1.

Page 77: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605 601

Fig. 5. (A) Log-normal distribution of latency times at criticality for aL = 256 system. Here c ≈ 0:055.Inset: three examples of these distributions in log–log scale for three dierent values (as indicated);(B) Distributions of queue lengths for the same system at dierent rates. Scaling is observable at inter-mediate values close to criticality.

long tails: a high uctuation regime is present. As goes into the congestion phase,longer times are present but also long tails. This is due to the fact that at this phasethe number of packets is always increasing with time.The distribution of queue lengths is equivalent to the distribution of jam sizes in

the highway trac model. As with the Nagel–Schreckenberg model, the distributionapproaches a power law for ≈ c but it also displays some bending at small values(and a characteristic cuto at large values). This is probably the result of the presenceof spatial structures, which propagate as waves of congestion and will be analyzedelsewhere (Valverde and Sole, in preparation).

4. Eciency, uncertainty and information transfer

As mentioned in the introduction, models of highway trac ow revealed that the ow of cars (and thus the system’s eciency) is maximal at the critical point, but thatthe unpredictability is also maximal.

Page 78: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

602 R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605

Fig. 6. Information transfer for three dierent lattice sizes (as indicated). Information transfer grows rapidlyclose to criticality but reaches a maximum at some point ∗ close to c.

Eciency can be measured in several ways. One is close to our model properties:eciency is directly linked to information transfer and thus information-based measurescan be used. Here we consider an information-based characterization of the dierentphases by means of the Markov partition . Specically the following binary choiceis performed:

= n(r) = 0⇒ S(r) = 0; n(r)¿ 0⇒ S(r) = 1which essentially separates non-jammed from jammed nodes.Information transfer is maximized close to second-order phase transitions [16,17] and

should be maximum at c. In order to compute this quantity we will make use of theprevious partition . Let S(r) and S(k) the binary states associated with two givenhosts in L. The -entropy for each host is given by

H (r) =−∑

S(r)=0;1

P(S(r)) logP(S(r)) (4)

and the joint entropy for each pair of hosts,

H (r; k) =−∑

S(r); S(k)=0;1

P(r; k) logP(r; k) ; (5)

where for simplicity we use P(r; k) ≡ P(S(r); S(k)) to indicate the joint probability.From the previous quantities, we can compute the information transfer between two

given hosts (Fig. 6). It will be given by

M (r; k) = H (r) + H (k)− H (r; k) : (6)

The average information transfer will be computed from Mq = 〈M (r; k)〉 where thebrackets indicate average over a sample of q hosts randomly chosen from the wholeset (here q= 100).

Page 79: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605 603

Fig. 7. The variance plot for the L= 32 system. The critical point c is perfectly indicated by this measurewith a sharp maximum. Three dierent times have been used in the averages.

At the sub-critical domain, in terms of information transfer under the Markov parti-tion, all pairs of nodes will be typically in the non-congested (free state) and P(i; j) ≈00 so it is easy to see that in this phase we have vanishing entropies and the mutualinformation is small. The information is totally dened by the entropy of the singlenodes, as far as the correlations are trivial. A similar situation holds at the congestionphase, where nodes are typically congested. At intermediate values, the uctuationsinherent to the system lead to a diversity of states that gives a maximum informationtransfer at some ∗c . It should be noted however that this measure is not very good forsmall systems, where ∗c ¿c, but we can see that ∗c → c as L increases.Unpredictability will be measured, following Nagel and Rasmussen [13] by means

of the normalized variance of latencies:

(TL) =[(TL − 〈TL〉)2]1=2

〈TL〉 ; (7)

where 〈TL〉 is the average over a given number of steps.The unpredictable nature of the critical point is sharply revealed by the plot of the

variance (TL) (Fig. 7). We can see that, as it was shown by Nagel and co-workersfor highway trac, the system shows the highest unpredictability close to the criticalpoint. At the subcritical regime ¡c, the packets reach their destinations in a timeclose to the characteristic, average time of traveling. This situation sharply changesin the neighborhood of c where the uctuations (experienced as local congestion)lead to a rapid increase in the variance. As grows beyond the transition, these uctuations are damped and (TL) decays slowly. This result also conrms the studyby Nagel and co-workers who analyzed the behavior of the variance of travel times[13] for a closed-loop system. They found that there was a nontrivial implication forthis result: increasing eciency (i.e., trac ow) tunes the system to criticality and asa consequence to unpredictable behavior.

Page 80: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

604 R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605

5. Discussion

In this paper we have analyzed the statistical properties of a computer networktrac model. This model is a simple extension of the OS system, but with a randomdistribution of hosts scattered through L with a density . One of the goals of ourstudy was to see if the reported regularities from real networks of computers (such asthe Internet) were similar to those observed in highway trac and reproducible by ourmodel. The second was to explore the possibility that the observed features correspondto those expected from a near-to-critical system. We have presented evidence that realInternet trac takes place close to a phase transition point, although further work isneeded involving the scale-free network topology of the real web. However, our resultprovide a preliminary support to the presence of critical phenomena in parallel arraysof computers.The model has been shown to match some basic properties of Internet dynamics: (i)

it shows self-ane patterns of activity close to criticality, consistent with the fractalnature of computer trac; (ii) the observed time series display 1=f behavior andthe corresponding Hurst exponents reveal the presence of persistence and long-rangecorrelations in congestion dynamics, as reported from real data; (iii) the distribution oflatency times close to the transition point is a lognormal, and the distribution of queuelengths approaches power laws with some bending for small lengths (as in highwaytrac models).The model conrms the previous conjecture [7] suggesting some deep links between

the NS model and the dynamics displayed by computer networks close to critical points.In this sense, the previous measures and other quantitative characterizations support theidea that the two type of trac share some generic features. The model also exhibitsthe same kind of variance plot shown by the NS and related models: it is almost zero atthe subcritical (free) regime and it abruptly grows close to c. This leads to the sameconclusion pointed by Nagel and co-workers: maximum eciency leads to complexdynamics and unpredictable behavior.Some authors have discussed the origins of Internet congestion in terms of the inter-

actions among users [12]. Huberman and Luckose suggested that this is a particularlyinteresting illustration of a social dilemma. Our study suggests a somewhat complemen-tary view: there is a feedback between the system’s activity and the user’s behavior.Users introduce new packets into the system, thus enhancing the congestion of thenet. As congestion increases, users tend to leave the net, thus reducing local activity.This type of feedback is similar to the dynamics characteristic of self-organized criticalsystems (such as sandpiles) [18]. The main dierence arises from the driving. Activityis being introduced into the system without a complete temporal separation betweentwo scales. In this sense, this is not a self-organized critical system but it is closeenough to be the appropriate theoretical framework. An immediate extension of thismodel should contain a self-tuning of : users might increase their levels of activity ifcongestion is low and decrease it (or leave the system) in a very congested situation.In this way, the system might self-organize into the critical state.

Page 81: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

R.V. Sole, S. Valverde / Physica A 289 (2001) 595–605 605

Our previous results can also be applied to other, similar networks. This is the caseof large, parallel arrays of processors. In this sense, some previous studies [19] haveshown the validity of the OS model in describing the overall dynamics of small arraysof processors with simple topologies. They also found some additional phenomena suchas the presence of hot spots, which we have also observed in our model. This work canbe extended to high-dimensional parallel systems such as the connection machine [20]in order to test the presence of phase transitions and their dependence on dimensionality(in particular they will help to determine the upper critical dimension for this system).Internet dynamics and the WWW growth provide an extremely interesting, real evo-

lution experiment of a complex adaptive system [12]. In the near future, we are likelyto see new types of behavior in the web. As Daniel Hillis predicts, as the informationavailable on the Internet becomes richer, and the types of interactions among comput-ers become more complex, we should expect to see new emergent phenomena goingbeyond any that has been explicitly programmed into the system [21]. Models basedon phase transitions in far from equilibrium systems will be of great help in providingan appropriate theoretical framework.

Acknowledgements

We thank B. Luque for many useful discussions and his earlier participation inthis work. This work has been supported by a grant PB97-0693 and by the Santa FeInstitute (RVS).

References

[1] B. Huberman (Ed.), The Ecology of Computation, North-Holland, Amsterdam, 1989.[2] J.O. Kepart, T. Hogg, B. Huberman, Phys. Rev. A 40 (1989) 404.[3] R. Albert, H. Jeong, A.-L. Barabasi, Nature 406 (2000) 378.[4] M. Takayasu, K. Fukuda, H. Takayasu, Physica A 274 (1999) 140.[5] M. Takayasu, H. Takayasu, K. Fukuda, Physica A 277 (2000) 248.[6] W.E. Leland, M.S. Taqqu, W. Willinger, IEEE Trans. Networking 2 (1994) 1.[7] I. Csabai, J. Phys. A: Math. Gen. 27 (1994) L417.[8] K. Nagel, M. Schreckenberg, J. Phys. I France 2 (1992) 2221.[9] K. Nagel, M. Schreckenberg, J. Phys. A 26 (1993) L679.[10] K. Nagel, M. Paczuski, Phys. Rev. E 51 (1995) 2909.[11] M. Takayasu, H. Takayasu, T. Sato, Physica A 233 (1996) 824.[12] B.A. Huberman, R.M. Luckose, Science 277 (1997) 535.[13] K. Nagel, S. Rasmussen, in: R.A. Brooks, P. Maes (Eds.), Articial Life IV, MIT Press, Cambridge,

MA, 1994, p. 222.[14] T. Ohira, R. Sawatari, Phys. Rev. E 58 (1998) 193.[15] H. Fucks, A.T. Lawniczak, preprint adap-org=9909006.[16] R.V. Sole, S.C. Manrubia, B. Luque, J. Delgado, J. Bascompte, Complexity 1(4) 1996.[17] R.V. Sole, O. Miramontes, Physica D 80 (1995) 171.[18] P. Bak, C. Tang, K. Wiesenfeld, Phys. Rev. Lett. 59 (1987) 381.[19] K. Bolding, M.L. Fulgham, L. Snyden, Technical Report CSE-94-02-04.[20] W.D. Hillis, The Connection Machine, MIT Press, Cambridge, MA, 1985.[21] W.D. Hillis, The Pattern on the Stone, Weidenfeld and Nicolson, London, 1998.

Page 82: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

82 CHAPTER 4. ARTICLES

4.4 Self-organized Critical Traffic in Parallel ComputerNetworks

Page 83: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Physica A 312 (2002) 636–648www.elsevier.com/locate/physa

Self-organized critical tra%c inparallel computer networksSergi Valverdea, Ricard V. Sol+ea;b; ∗

aICREA-Complex Systems Lab, Universitat Pompeu Fabra-IMIM, Dr. Aiguader 80,08003 Barcelona, Spain

bSanta Fe Institute, 1399 Hyde Park Road, NM 87501, USA

Received 12 February 2002

Abstract

In a recent paper, we analysed the dynamics of tra%c /ow in a simple, square latticearchitecture. It was shown that a phase transition takes place between a free and a congestedphase. The transition point was shown to exhibit optimal information transfer and wide /uc-tuations in time, with scale-free properties. In this paper, we further extend our analysis byconsidering a generalization of the previous model in which the rate of packet emission is regu-lated by the local congestion perceived by each node. As a result of the feedback between tra%ccongestion and packet release, the system is poised at criticality. Many well-known statisticalfeatures displayed by Internet tra%c are recovered from our model in a natural way. c© 2002Published by Elsevier Science B.V.

PACS: 87.10.+e; 0.5.50.+q; 64.60.Cn

1. Introduction

Statistical physics has been shown to be a powerful approach to the analysis ofnetwork dynamics. Scaling concepts have provided the framework to understand, forexample, the origin of scale-free properties of Internet [1,2]. In this context, simplemodels of network growth reveal that the scale-free nature of the web is an emer-gent pattern resulting from the mechanisms of growth plus preferential attachment oflinks. As a result of this process, the topology of the web provides a source of ro-bustness against random removal of nodes and, simultaneously, an intrinsic fragility

∗ Corresponding author. Santa Fe Institute, 1399 Hyde Park Road, NM 87501, USA.E-mail address: [email protected] (R.V. Sol+e).

0378-4371/02/$ - see front matter c© 2002 Published by Elsevier Science B.V.PII: S 0378 -4371(02)00872 -5

Page 84: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648 637

against intentional attack. These properties and others (such as the spread of computerviruses [3,4]) are associated to the presence of phase transitions in network topologyor dynamics.

Beyond the topological features exhibited by these networks, a diDerent (but obvi-ously related) problem concerns the dynamics of information /ow among their units.In this context, the use of nonlinear dynamics techniques has contributed to the devel-opment of the Eeld of computational ecologies (CE). These CE are deEned in termsof distributed parallel processing in large computer networks [5,6]. As Huberman andco-workers have shown, the interactions arising in CE leads to “self-regulating com-putational entities very diDerent in nature from their individual components”. Sincethe discovery of the self-similar patterns displayed by the time /uctuations of pack-ets in computer networks [7–9] many further studies supported the view that Inter-net tra%c might be related to the presence of near-critical dynamics. In this context,dynamic phase transitions have been observed to occur in the tra%c going througha link [10,11].

Heavy-tailed distributions are observable in most characteristic features of com-puter network dynamics, from queue lengths to latency times [10,12]. Many dedi-cated studies have been devoted to the analysis of (multi)fractal features of tra%c[13] although most of them lack an explanatory framework for the origin of theself-similarity, since no microscopic approach to the dynamics is considered in anexplicit form. In this context, complex /uctuations displaying self-similar behaviourare often (though not always) related to the presence of criticality. In order to testthis scenario, appropriate models of tra%c dynamics are required. In two recentpapers [14,15], such possibility has been explored, inspired by previous models ofvehicular tra%c [16,17]. The two models used a very simple square-latticestructure.

Although not realistic as a model of Internet topology [1,2] these architectures havebeen already used in real parallel multiprocessor nets [18–20]. Apart from the torustopology, meshes, hypercubes [21] and hierarchic trees [22] are also common. In thesemultiprocessor networks, the routing algorithm addresses common objectives (mini-mization of packet delivery time, or latency, and maximization of throughput) withdistributed networks (local and wide area networks).

Besides the observed similarities found in these networks, an important diDerencebetween parallel computer networks and distributed networks (such as the Internet)arises: multiprocessor networks use blocking routing algorithms in which messagesthat cannot be forwarded to next router are blocked until they can be serviced. On theother hand, in distributed networks it is quite common to discard packets if there is notenough room available at the router [23]. This problem was avoided here by allowinginEnite memory at every router of our model. We recognize the importance of realisticmemory constraints at routers. But, in the context of our present study, we suspectthat those memory limitations only introduce Enite size eDects. Some another studiesshow that in some circumstances, even if a huge memory is available, congestion willcontinue to appear [24].

Parallel multiprocessor networks have been shown to display complex dynamics anda phase transition separating a congested from a non-congested phase, both in the real

Page 85: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

638 S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648

[25] and the simulated counterparts. However, our previous model lacks an importantingredient present in the Internet: the reaction of nodes to congestion. SpeciEcally,users will increase the number of packets in the system depending of the degree ofcongestion that they perceive. In a related context, Huberman and Lukose alreadyexplored this problem within the context of social dilemmas: the actions of individ-uals lead to a negative eDect on the network performance, which feeds back intousers [26].

This notion of load control in presence of congestion is very important in computernetworks and it is commonly accepted that this mechanism ensures fair access to theshared resources the network oDers to its users [23]. That is, misbehaved users shouldbe constrained because it introduces more load and more congestion. In other words,a feedback between the system’s state (number of packets) and rate of packet releaseby users must be present. We can easily identify these ingredients as order and controlparameters, respectively. Their interaction can lead to a system poised close a criticalstate. In this paper we explicitly explore this possibility by means of a simple modelof tra%c /ow on a square lattice with periodic boundary conditions.

2. Trac dynamics: model

Our goal is to construct a minimal, microscopic model of tra%c to be able to recoversome of the basic features exhibited by real computer networks. Additionally, our modelshould be able to show if the networks can be self-organized in such a way that localhosts self-regulate their throughput by depending on local congestion.

Following the previous approaches [27,15] let us consider network deEned on atwo-dimensional lattice L(L) formed by L × L nodes, with four nearest neighboursper node. The model considers two types of nodes: hosts and routers. 1 The Erst arenodes that can generate and receive messages and the second can only store and forwardmessages. All our simulations are performed using periodic boundary conditions. Asmentioned in the previous section, although this might seem a limited topologicalarrangement, it has been successfully tested on real hardware and the presence of aphase transition fully conErmed [25].

As in our previous analysis [15], only a fraction of the nodes are hosts and therest are routers. The location of each object, r∈L(L), will be given by r= icx + jcy,where cx; cy are Cartesian unit vectors. So the set of nearest neighbours C(r) is givenby

C(r) = r− cx; r + cx; r− cy; r + cy : (1)

Each node maintains a queue of stored packets as they arrive to it. The local number ofpackets (the queue length Q) will be indicated as n(r; t) and thus the total number of

1 Under the term router we group gateways, switches and routers. The term host is used to name allend-systems.

Page 86: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648 639

packets in the system will be

N (t) =∑

r∈L(L)

n(r; t) (2)

and the metric used in our system will be given by the Manhattan metric deEned forlattices with periodic boundaries:

dpm(r1; r2) = L−∣∣∣∣|i1 − i2| − L

2

∣∣∣∣−∣∣∣∣|j1 − j2| − L

2

∣∣∣∣ ;where rk = (ik ; jk).

In the previous model [15] the rate of inserted tra%c into the network was a Exed,external parameter . The model displayed a phase transition for a certain value c

that results in two phases: the free or non-jammed phase and the jammed phase. Sincethe global activity, as measured in terms of the average number of packets 〈N (t)〉 actsas an order parameter and the driving is introduced through the (control) parameter ,we conjectured that an appropriate feedback between the two of them would be ableto self-organize the system into a critical state [29,30].

In order to test the previous conjecture, a new model will be introduced. The mainidea here is that hosts can modify their rates of packet release by depending on thelocal rate of congestion that they detect. To properly react to congestion, sources oftra%c (hosts) must be informed in some way.

Two basic rules are used (and summarized in Fig. 1). The Erst describes the feedbackexisting between rate of emission and local congestion experienced by the host. Thesecond describes the routing algorithm.

Fig. 1. Network model. Two types of nodes are considered: hosts (squares) and routers (open circles). Thenodes are connected through bi-directional links. From top to bottom and left to right, the Egure shows asample sequence of routing steps of a packet that travels from host S to host D. The resulting path is markedwith thick black line. The counter attached to each outgoing link is updated every time a packet travelsthrough it. The routing algorithm only allows minimal paths (bottom sequence) and avoids link overloadingwhen several choices are possible (top sequence).

Page 87: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

640 S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648

2.1. Rate control

Let us indicate by the number of local congested neighbours:

=∑

k∈C(r)

[n(k; t)] ; (3)

where [x] = 1 for x¿ 0 and zero otherwise. The rate of packet release changes withtime, for a given router located at r∈L(L), is updated following:

(r; t + 1) = min1; (r; t) + (4)

if ¡ 4 and goes down to zero for = 4. The host tries to maximize their use ofnetwork resources, injecting more and more tra%c into the network until congestionis detected. In our setting, each time step the local creation rate goes up a Exed rate (here we take = 0:01) and drops to zero if the neighbouring nodes are congested.This rule is inspired by the “additive increase=multiplicative decrease” [31] and widelyused in distributed networks.

2.2. Routing

Each node picks up the packet at the head of its queue and decides which outgoinglink is better suited to the packet destination. Consider the packet is at node r=(rx; ry)and its Enal destination is a host at d=(dx; dy). DeEne v=(vx; vy) as: v= d− r (v=0when the packet is at destination). Otherwise, we have only two possibilities: goingout through an horizontal link (r + cx or r − cx) or going out through a vertical one(r+cy or r−cy). Which one is chosen depends on what outgoing link takes the packetcloser to its destination. In case x = 0 and y = 0 (Fig. 1, bottom left), then next hopm is: m = r + cx if x¿ 0 and m = r − cx if x¡ 0. A similar rule is deEned for thecase of x = 0 and y = 0.

The general situation corresponds to x = 0 and y = 0 and in this case we look atthe counters associated with the outgoing links to decide which one is chosen by therouting algorithm (Fig. 1, top left and top right). Each pair of neighbouring nodes r; r′

are connected through a pair of directed links !(r; r′) and !(r′; r). The strength of theselinks is updated by one unit each time a packet /ows through them. If a packet /owsfrom r → r′, then !(r; r′) → !(r; r′) + 1. The counter is thus used to remember howmany packets have passed through that link. In order to avoid overloading a particularlink, the router chooses the link with minimum counter.

3. Mean eld model

A simple mean Eeld model can be obtained for the total density of packets ≡N (t)=L2 and . The number of travelling packets increases as a consequence of theconstant pumping from the hosts, which occurs at a rate . On the other hand packetsare removed from the system if the lattice is not too congested (i.e., if free space formovement is available) but accumulate as a consequence of already jammed nodes.

Page 88: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648 641

We can roughly state that once the number of packets exceeds the number of latticesites, congestions will lead to packet accumulation.

In a previous study, Fuk+s and Lawniczak present a simple argument to End thecritical load rate c [28]. Following Little’s law from queuing theory [32], at thesub-critical (free) phase, the number of packets created per unit time (i.e., L2) equalsthe number of packets delivered per unit time. If (L) indicates the average transittime, then

N

= L2 : (5)

Criticality is given by the condition N = L2 and for that case, c = 1=(L). It can beshown that the average transient time in this phase is = L=2, and thus c = 2=L, invery good agreement with numerical results [28].

The time evolution of the density of packets will follow the mean Eeld equation:ddt

= − q(1− ) : (6)

The last term indicates the rate of removal, which is proportional to the number ofinput pathways (number of neighbours q) available to incoming packets. The rate iseasily computed: it corresponds to the inverse of (L). For constant (the case alreadyconsidered in our previous study) we have the Exed points ±=[1±(1−4=q)1=2]=2.For ¿c ≡ q=4, the Exed points vanish and no Enite density exists. In this situation(which we labelled the congested phase) the density of packets grows without bounds.For ¡c, a Enite stable density

− =12

[1−

(1− 4

q

)1=2]

(7)

is observable (the other Exed point + is unstable). For the particular case = 1analysed by Fuk+s and Lawniczak [28], we recover their critical point c=2=L assumingthat = L=2.

Now, assuming that a feedback exists between packet delivery rates and density,some Enite equilibrium density ∗ will be achieved in the previous model and thus nodivergence will be allowed to occur. In this case, the mean Eeld model indicates thata scaling relation will be observed, i.e.,

∼ −1 (8)

between the (self-organized) packet release rate and the density of hosts (which is herethe only relevant external parameter). As shown below, this scaling relation holds inthe simulated model, together with the presence of several scaling properties consistentwith a critical state.

4. Spatiotemporal dynamics

Let us Erst consider the spatiotemporal features exhibited by our model. InFig. 2(A) and (B) we show examples of the time series obtained. Here the changes

Page 89: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

642 S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648

Fig. 2. (A,B) Two examples of the time series displayed by the local creation rate at two given (host)lattice points, indicated by arrows on the network L(L), where the periodic boundary conditions are explicitlydisplayed. Here: = 0:01; L = 64, and = 0:0325. We can appreciate the wide /uctuations in , includingperiods of stasis at maximum = 1 rate. The network in (C) indicates the amount of local congestion ateach node by means of a grey scale. Lighter nodes indicate larger congestion (using logarithmic scale).

in at two given hosts are shown. The upper plot (A) is rather characteristic: thehost experiences periods of stasis with maximum throughput ( = 1 and many spikeswith a broad range of delivery rates. In (B) a node experiencing frequent conges-tion is shown. Other nodes in the system display very long periods with = 1 (notshown). In (C) the spatial pattern of activity is displayed. Here the topology of L(L)is explicitly shown and each node is indicated as a square. The grey scale provides

Page 90: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648 643

10-2

10-1

100

Local creation rate

104

105

106

107

Freq

uenc

y

10-2

10-1

100

104

105

106

107

ρ=0.20

ρ=0.10

ρ=0.032

Fig. 3. Distributions of local creation rates for L = 64 at three diDerent host densities. We can see that awide range of values is observable, thus indicating that a high heterogeneity is present.

a measure of the queue length (here in log scale). Lighter squares indicate highercongestion.

The system displays a wide variability in space and time, in spite of the homoge-neous nature of the rules. The quenched randomness present in the host distribution,together with the emergence and competition of diDerent tra%c pathways creates adynamic pattern of tra%c /ow and congestion. In this context, although the networkself-organizes towards a given (average) c, wide /uctuations in the loads are observ-able in diDerent parts of the lattice. By analysing a given part of the net, we willobserve that the load can be high or small, but scaling is always present, with diDerent local values. This observation helps to understand the apparent disagreement [33]between the presence of a given average critical and the fact that scaling is observedin real networks with very diDerent loads. Such heterogeneity is also present in oursystem and is well illustrated in Fig. 3, where the distribution of local creation ratesN () is displayed for three diDerent host densities . The peaks at =1 indicates that,as a result of the feedback between congestion and delivery rate, the system is able tomaintain a large fraction of hosts in an essentially non-congested state. This seemsto be consistent with the high e%ciency displayed by the original tra%c model closeto criticality [15].

5. Scaling laws

The Erst scaling relation to be checked is the one derived from the mean Eeldapproximation, relating and . In Fig. 4 we show our results for a L = 64 lattice.

Page 91: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

644 S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648

10-2

10-1

100

ρ

10-1

100

λ(ρ)

-1.03

Fig. 4. Scaling dependence between the average (critical) and host density. Simulations have been per-formed on a L = 64 lattice. The dashed line displays the predicted functional relation as derived from themean Eeld theory. The average has been computed over T = 5× 104 steps after 104 transients have beendiscarded.

Here 〈〉 was averaged (using all hosts) over T = 5 × 104 steps after 104 transientswere discarded. A scaling relation is obtained

〈〉 ∼ −1:03±0:03 ; (9)

where the slope was estimated for ¿ 0:03. Below this host density, the /uid /ow ofpackets guarantees that no congestion will typically occur, and thus the system doesnot reach the predicted critical load rate.

In order to conErm that the system is self-organized close to criticality, two relevantquantities can be analysed in relation with system size. These are the average transittime 〈(L)〉 and the packet density 〈N (L)〉. As discussed before, at the free phase (inthe original model) the characteristic transit time will scale linearly with system’s size,while the number of packets will scale as N∼L2 at criticality. Our results are shownin Fig. 5(A) and (B). The transit time scales linearly, with an exponent very close toone, and we also obtain a scaling relation: (Fig. 6)

〈N (L)〉 ∼ L2:14±0:07 (10)

consistently with the theoretical prediction. These results indicate that, in spite of thehigh heterogeneity displayed by the system, the average congestion rate is the oneexpected for the system at criticality.

Finally, we can also estimate the distribution N (D) of congestion duration lengthsD. Here, D is deEned as the time between two consecutive (non-contiguous) momentsof no congestion. Here a node will be labelled as non-congested if n(r; t)6 1. In aprevious experimental study, Takayasu and co-workers showed that these distributions

Page 92: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648 645

101

102

10310

1

102

103

104

τ(L

)

101

102

10310

1

102

103

104

101

102

103

L

102

104

106

N(L

)

101

102

10310

2

104

106

2.14

1.11

(a)

(b)

Fig. 5. (a,b) Scaling behaviour in average transient times (L) and packet numbers N (L) for diDerent latticesizes (here L = 16; 32; 64; 126; 252; 512) with = 0:10. As predicted by the critical condition, N ∼ L2 andthe transient time is linear with L.

100

101

102

103

Congestion duration length

100

102

104

Freq

uenc

y

100

101

102

103

100

102

104

-1.5

Fig. 6. Congestion duration length frequencies N (D) measured during T =5× 104 steps at hosts L=64 andunder diDerent host densities ( = 0:1 and 0.2). Both distributions follow power laws N (D) ∼ D−" withsame exponent " ≈ 1:5, in agreement with analyses of real tra%c (see text).

Page 93: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

646 S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648

display a scaling N (D) ∼ D−# with # ≈ 1:5− 2:0, measured at diDerent /ow densities(close to the critical-supercritical domain). In our model, we obtain a scaling

N (D) ∼ D−1:66±0:23 ; (11)

where the slope has been estimated for 101 ¡D¡ 103 using diDerent host densities.Again, in spite of the quantitative diDerences arising from local tra%c /ow, the presenceof scaling is widespread and consistent with (globally tuned) criticality.

6. Discussion

In this paper, we have extended our previous approach to the tra%c dynamics onparallel networks by including a simple feedback mechanism between packet releaseand net congestion. The main goal of our study was to show that under this feedbackcontrol, the network self-organizes into a critical state characterized by scaling in sev-eral relevant quantities such as congestion duration lengths. Using a simple mean Eeldmodel, it was shown that at the stationary state the average packet release shouldscale as the inverse of the host density. This scaling behaviour has been conErmed bythe simulation model. Although our results are inspired in a particular spatial arrange-ment (characteristic of real parallel computer designs) we think that some of these ideaswould be easily extended to Internet dynamics. The general validity of our approachto diDerent types of nets has already been suggested by some authors [34].

The model shows a considerable spatial heterogeneity: diDerent characteristic patternsof activity are identiEed at diDerent locations, consistently with observed networks.In this sense, although the average packet release reaches a steady (critical) state,this does not mean that the same levels of congestion are reached everywhere. Onaverage, the system is able to maintain the number of packets circulating close to thecritical value N ≈ Nc = L2 and the average transit time is also shown to scale as ∼ L, consistently with previous predictions for systems close to the phase transitionpoint [28].

Our model is oversimpliEed in a number of ways, specially when considering thesquare topology and the simpliEed nature of the rules. Conventional Internet protocols(TCP=IP) interpret packet losses as an implicit signal of congestion [23]: packets aredrop due to limited storage at intermediate nodes. In our setting, there cannot be packetlosses since queues are unbounded. Note that memory constraints will be no longerthe main bottleneck in the future [24], as router capacity is continuously increasing.In spite of this, real protocols already considered the possibility of sending an explicitcongestion signal by diDerent considerations [35]. Since latency is directly related notonly to the number of hops between source and destination, but also to current load(local number of packets) at traversed routers, it is reasonable to keep queue lengthsbelow some threshold [36]. For simplicity, we consider in our model that a router iscongested if its queue length is greater than one packet.

Also, there is the question if the congestion state information must be exchangedbetween distant pairs of source and destination hosts (as in classic TCP=IP) or in a morelocal fashion. In general, it is believed that no single scheme will be able to address all

Page 94: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648 647

congestion patterns observed in a real computer network. In order to avoid introducing‘artiEcial’ long range correlations between distant hosts of the system, we considereda local exchange of congestion information. We only allow the hosts to gather andinterpret the congestion state of their neighbour nodes, because only this kind of nodecan regulate its packet injection rate. Surprisingly, we still observe synchronizationeDects in the congestion state of distant nodes.

Some authors have repeatedly argued that computer networks cannot be understoodin terms of this type of analysis. It is said that these nets involve a high degree ofcomplication and Ene structure that has to be taken into account [33]. This claim istrue but obvious, and can be used in any other context. However, simple approachesbased on statistical physics have been successful in gathering real understanding of howcomplex systems behave from microscopic rules. The current wave of new quantitativeresults on Internet topology and dynamics indicates that a new area of research hasemerged at the interplay of physics, graph theory and technology.

Acknowledgements

This work has been supported by a grant PB97-0693 and by the Santa Fe Institute(RVS).

References

[1] R. Albert, H. Jeong, A. Barabasi, Nature 401 (1999) 130;R. Albert, A. Barabasi, Science 286 (1999) 510.

[2] G. Caldarelli, R. Marchetti, L. Pietronero, Europhys. Lett. 52 (2000) 386.[3] R. Pastor-Satorras, A. Vespignani, Phys. Rev. Lett. 86 (2001) 066117.[4] A. Lloyd, R.M. May, Science 292 (2001) 1316–1317.[5] B. Huberman (Ed.), The Ecology of Computation, North-Holland, Amsterdam, 1989.[6] J.O. Kephart, T. Hogg, B.A. Huberman, Phys. Rev. A 40 (1989) 404.[7] W.E. Leland, M.S. Taqqu, W. Willinger, IEEE Trans. Networking 2 (1994) 1.[8] I. Csabai, J. Phys. A: Math. Gen. 27 (1994) L417.[9] M. Takayasu, H. Takayasu, T. Sato, Physica A 233 (1996) 824.

[10] M. Takayasu, K. Fukuda, H. Takayasu, Physica A 274 (1999) 248.[11] M. Takayasu, K. Fukuda, H. Takayasu, Physica A 277 (2000) 248.[12] A.E. Crovella, A. Bestavros, IEEE Trans. Networking 5 (1997) 835.[13] W. Willinger, M.S. Taqqu, R. Sherman, D.V. Wilson, IEEE Trans. Networking 5 (1997) 71.[14] T. Ohira, R. Sawatari, Phys. Rev. E 58 (1998) 193.[15] R.V. Sol+e, S. Valverde, Physica A 289 (2001) 595.[16] K. Nagel, M. Paczuski, Phys. Rev. E 51 (1995) 2909.[17] K. Nagel, S. Rasmussen, in: R.A. Brooks, P. Maes (Eds.), ArtiEcial Life IV, MIT Press, Cambridge,

MA, 1994, p. 222.[18] H. Li, M. Manesca, IEEE Trans. Comput. 38 (1989) 1345.[19] C. Germain-Renaud, J.P. Sansonnet, Ordinateurs Massivement Paralleles, Armand Colin, Paris, 1991.[20] V.M. Milutinovic, Computer Architecture, North-Holland, Elsevier, Amsterdam, 1988.[21] W.D. Hillis, The Connection Machine, MIT Press, Cambridge, MA, 1985.[22] A. Arenas, A. Diaz-Guilera, R. GuimerSa, Phys. Rev. Lett. 86 (2001) 3196.[23] V. Jacobson, M.J. Karels, Proceedings of the SIGCOMM, 1988, p. 314.

Page 95: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

648 S. Valverde, R.V. Sol0e / Physica A 312 (2002) 636–648

[24] R. Jain, IFIP TC6, Proceedings of the Fourth Conference on Information Networks and DataCommunication, Finland, 1992.

[25] K. Bolding, M.L. Fulgham, L. Snyder, Technical Report CSE-94-02-04.[26] B.A. Huberman, R.M. Luckose, Science 277 (1997) 535.[27] T. Ohira, R. Sawatari, Phys. Rev. E 58 (1998) 193.[28] H. Fuk+s, A.T. Lawniczak, preprint adap-org=9909006.[29] P. Bak, C. Tang, K. Wiesenfeld, Phys. Rev. Lett. 59 (1987) 381.[30] H.J. Jensen, Self-Organized Criticality, Cambridge University Press, Cambridge, 1998.[31] D.M. Chiu, R. Jain, Comput. Networks ISDN Systems 17 (1989) 1.[32] R. Nelson, Probability, Stochastic Processes and Queuing Theory, Springer, New York, 1995.[33] W. Willinger, R. Govindan, S. Jamin, V. Paxson, S. Shenker, Proc. Natl. Acad. Sci. USA 99 (2002)

2573.[34] L. Zhang, S. Shenker, D. Clark, ACM Comput. Commun. Rev. (1991).[35] R. Jain, K.K. Ramakrishnan, Proceedings of the IEEE Computer Networking Symposium, Washington,

DC, April 1988, p. 134.[36] S. Floyd, V. Jacobson, IEEE=ACM Trans. Networking (1993).

Page 96: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

96 CHAPTER 4. ARTICLES

4.5 Internet’s Critical Path Horizon

Page 97: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Eur. Phys. J. B 38, 245–252 (2004)DOI: 10.1140/epjb/e2004-00117-x THE EUROPEAN

PHYSICAL JOURNAL B

Internet’s critical path horizon

S. Valverde1 and R.V. Sole1,2,a

1 ICREA-Complex Systems Lab, Universitat Pompeu Fabra, Dr Aiguader 80, 08003 Barcelona, Spain2 Santa Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA

Received 12 November 2003 / Received in final form 22 December 2003Published online 14 May 2004 – c© EDP Sciences, Societa Italiana di Fisica, Springer-Verlag 2004

Abstract. Internet is known to display a highly heterogeneous structure and complex fluctuations in itstraffic dynamics. Congestion seems to be an inevitable result of user’s behavior coupled to the networkdynamics and it effects should be minimized by choosing appropriate routing strategies. But what are therequirements of routing depth in order to optimize the traffic flow? In this paper we analyse the behavior ofInternet traffic with a topologically realistic spatial structure as described in a previous study [S.-H. Yooket al., Proc. Natl Acad. Sci. USA 99, 13382 (2002)]. The model involves self-regulation of packet generationand different levels of routing depth. It is shown that it reproduces the relevant key, statistical featuresof Internet’s traffic. Moreover, we also report the existence of a critical path horizon defining a transitionfrom low-efficient traffic to highly efficient flow. This transition is actually a direct consequence of theweb’s small world architecture exploited by the routing algorithm. Once routing tables reach the networkdiameter, the traffic experiences a sudden transition from a low-efficient to a highly-efficient behavior. Itis conjectured that routing policies might have spontaneously reached such a compromise in a distributedmanner. Internet would thus be operating close to such critical path horizon.

PACS. 89.75.-k Complex systems – 05.70.Ln Nonequilibrium and irreversible thermodynamics – 87.23.GeDynamics of social systems

1 Introduction

The efficient performance of any communication networkis jeopardized by congestion problems, which often showup in unpredictable ways. This seems the case, for exam-ple, of so called Internet storms [2]. Such problems wereearly identified in different types of engineered networks.Norbert Wiener for example mentions that “a switchingservice involving many stages and designed for a certainlevel of failure shows no obvious signs of failure until thetraffic comes up to the edge of the critical point, when itgoes completely into pieces, and we have a catastrophictraffic jam” [3]. These observations allow to formulate anumber of key questions concerning communication nets,such as: How do critical traffic levels are reached? Whatis the nature of these thresholds? How appropriate rout-ing algorithms modify this behavior? Are there optimalrouting strategies?

Modeling Internet dynamics has been an active areaover the last decade. The approaches include detailed sim-ulations [4], simple statistical models and verbal mod-els [5]. In this context, in [6] we investigated the existenceof a jamming phase transition between the free phase andthe congested phase in a model of network traffic over

a e-mail: [email protected]

regular meshes. This phase transition depends on traf-fic density. It was conjectured that large-scale networktraffic self-organizes near the critical point of the tran-sition, which is linked to high network efficiency and un-predictability [7]. At the critical regime, the distributionof congestion duration lengths scales as a power-law [7,9].

Such a self-organized scenario is reinforced by a recentstudy suggesting that Internet fluctuations are a conse-quence of the internal dynamics of the system [10]. Againstprevious claims, there is mounting empirical support thatnetwork traffic heterogeneity is a consequence of collectivedynamics and not because of the high variability injectedby external sources.

In this paper we will show that Internet is able to routeefficiently its inner flow of packets because a special com-bination of local routing rules and a particular networkarchitecture. As we will see, Internet routing reaches anoptimal, low-cost traffic flow as a result of a trade-off be-tween random and deterministic routing schemes.

2 Modelling Internet’s traffic

Following previous approaches [6,7] let us consider net-work defined on a graph Ω with M nodes (Fig. 1). Thenumber of links connecting a given node with another (ir-respective of their characteristics) will be indicated as ki.

Page 98: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

246 The European Physical Journal B

Fig. 1. Small Internet-like network (see text for generationmethod description). The parameters have been set to the po-sition corresponding to Internet in the (Df , α, σ)- phase space:M = 250, 〈k〉 = 4, Df = 1.5, α = 1, σ = 1.

The subset of closest nodes of si ∈ Ω is Ci. Only a fractionof fixed ρ < (0, 1] nodes is selected as source and desti-nation traffic endpoints (hosts). Host locations are chosenrandomly. The other (1 − ρ)M nodes can only store andforward incoming messages (routers).

Both host and routers can route only one packet at atime (the routing policy is described below). Each node si

is provided with a finite queue of packets waiting for com-munication resources (free links). Queues are not allowedto store more than H packets simultaneously. New pack-ets arriving at an already saturated queue will be simplyremoved from the system. This provides a very simple dis-sipative mechanism useful for recovering from heavy trafficcongestion. If n(si, t) is the number of packets at si, thetotal number of packets in the system will be

N(t) =∑si∈Ω

n(si, t) (1)

which is the key quantity which has to be analysed here.Any microscopic host behavior compatible with the

fluid scenario must ensure a controlled injection of pack-ets in order to not flooding the network (and thus enteringthe congested state). Note that in a fluid traffic regime itis unlikely that packets will be alive for an arbitrarily longtime. A fluidity requirement will necessarily impose a hardconstraint on the maximum time spent by a packet trav-elling across the network. The simplest way of ensuringfluid packet flow is to stop the sources emitting new pack-ets when detecting local congestion, that is, when there isno empty space in the source neighborhood Ci to fill withnew packets. If no self-regulation is present, it has beenshown that there is a sudden and sharp jamming transi-tion from the free to the congested state depending on themean (system) packet generation rate 〈λ〉. At the criticalpoint between the two phases the system displays opti-mum (global) performance. If self-regulation is allowed, it

can be shown [7] that simple traffic source rules are ableto self-organize the traffic network around a definite meanrate 〈λ〉, which scales as the quotient between the averagesystem load 〈N〉 and the averaged packet latency 〈T 〉:

λC ≈ 〈N〉〈T 〉 (2)

where the latency is defined as the time comprised fromthe creation of a packet until its delivering at the des-tination host. The mean latency 〈T 〉 is averaged over allsuccessfully released packets. Following [7] we will indicateby ξ the number of local congested neighbors:

ξ =∑

k∈C(i)

θ [n(i, t)] (3)

where θ[x] = 1 for x > 0 and zero otherwise. The packetinjection rate for host i is updated as follows:

λi(t + 1) =

min 1, λi(t) + µ ξ = 0λi(t) 0 < ξ < ki

0 ξ = ki

(4)

where µ is the so-called driving parameter. The secondrule allows a particular host to stabilize around a givenrate. Traffic rate increases conservatively and drops downto zero when all neighboring nodes are congested. Thereader must be aware that the above rules are not in-tended to be a detailed model of real traffic sources. Trafficsources can not be described with an universal distribu-tion probability. Moreover, it seems that a rich variety ofdistributions (exponential, bi-modal or log-normal) apply.As we will see later in the paper, it is unlikely that thevariability of traffic sources will be the cause of the scal-ing detected in Internet traffic. Moreover, this model doesnot introduce any explicit correlations between sources(i.e.: like in active conversation between two hosts) sothe dynamics can not be the by-product of pre-definedsource correlations. Because the system is poised to crit-icality (the so-called fluid packet flow regime) the globaldynamics emerge from the collective behavior of its com-ponents. Ultimate source details are irrelevant in this con-text. Moreover, although TCP is more realistic than oursimplified local rules (also much more complex) our con-jecture is that the large-scale dynamics will not be greatlychanged by the local behavior of hosts within this fluidflow regime. This idea has been partly answered in a re-cent work [8] in which it has been shown that IP leveltraffic (packet dynamics) does not depend to any signif-icant extent on the TCP arrival process (host behavior),supporting our view that traffic dynamics is not a conse-quence of detailed source (host) behavior.

For a homogeneous network with average degree 〈k〉(and finite 〈k2〉) it can be shown, following [7] that thetime evolution of packet density Γ (t) = N(t)/M in thelimit of H →∞ is defined by the mean field equation:

dt= ρλ− 〈k〉

DΓ (1− Γ ) (5)

Page 99: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde and R.V. Sole: Internet’s critical path horizon 247

where D indicates the network diameter (also the averagetransient time, assuming packets jump from node to nodewith the same average time scale). It is not difficult toshow that the equilibrium points of the previous equationare given by Γ± = [1 ± (1 − 4ρλD/〈k〉)1/2]/2. For λ >λc ≡ 〈k〉/4Dρ, the fixed points vanish and no finite densityexists. In this “congested phase” the density of packetsgrows without bounds. For λ < λc, a finite stable density

Γ− =12

[1−

(1− 4ρλD

〈k〉)1/2

](6)

is observable (the other fixed point Γ+ is unstable). For theparticular case ρ = 1 analysed by Fuks and Lawnizak [12],we recover their critical point λc = 2/L for dynamics tak-ing place on a square lattice, i.e. D = L/2. If a feedbackexists between λ and γ, some finite equilibrium density Γ ∗will be achieved in the previous model and thus no diver-gence will be allowed to occur. In this case, the mean fieldmodel indicates that a scaling relation will be observed, i.e.

λ ∼ ρ−1 (7)

between the (self-organized) packet release rate and thedensity of hosts (which is here the only relevant externalparameter). Such a scaling relation was shown to occur inthe previous model.

The previous mean field calculation was done by as-suming that a fixed λ is being used. The previous rules ac-tually introduce self-regulation of injection rates by traffic.In other words, if traffic level N defines an order parame-ter, it will interact with a control parameter (λ), reducingit when N is large and increasing it when low. The feed-back between these two key quantities results in a statedominated by fluid, but fluctuating traffic with many char-acteristics in common with observed Internet’s dynamicalpatterns.

In order to explicitly consider the feedback between or-der and control parameters, we can consider a new meanfield approximation based on the previous local rules. Itcan be shown, assuming finite H , that the new set of equa-tions is now:

dt= ρλ

(1− γ

H

)− 〈k〉

DΓ (8)

dt= µ(1− λ)− Γ

〈k〉 (9)

for low density levels (i.e. Γ H , consistently with afluid traffic) the single fixed point (obtained from dΓ/dt =dλ/dt = 0) is

(Γ ∗, λ∗) =

1

〈k〉ρD + 1

µ〈k〉, 1− Γ ∗

〈k〉M

(10)

the Jacobi matrix L for the previous set of equations isgiven by

L =

(−〈k〉D ρ−1〈k〉 −µ

). (11)

The associated eigenvalues are

Λ± =12

−( 〈k〉

D+ µ

)±√( 〈k〉

D+ µ

)2

− 4ρ

〈k〉

both of them real and negative: the point attractor is glob-ally stable. Numerical simulations of the model on a Pois-sonian graph (where the previous approximation wouldheld) agree with these predicted values. Topological fea-tures are included only as averaged quantities, here meandegree 〈k〉 and diameter D. However, our interest is toexplore the traffic dynamics on a realistic network archi-tecture, in order to provide the closest modeling approachunder the previous rules. It has been shown that the In-ternet does not display the homogeneous architecture as-sumed by the Poissonian graphs [13]. In the next sectionthe network’s topology used in our analysis is presented.Together with an explicit definition of the routing algo-rithm, they will complete our model’s description.

3 Network topology

In previous analyses [6,7] the topology of the graph waschosen to be a lattice. Although this might seem a limitedtopological arrangement, it has been successfully testedon real hardware and the presence of a phase transitionfully confirmed [14]. Moreover, mesh connected topologiesmay be the most efficient solution at the limit of very largeparallel computer architectures [15,16].

Regular lattices fail to describe Internet topology. Re-markably, Internet displays a scale-free architecture [13].The origins and causes of this topology have been the sub-ject of most discussion and controversy. It has been showthat most existing Internet generators fall in a very dif-ferent region of the phase space where real Internet is lo-cated [1]. This phase space is defined by (Df , σ, α), whereevery parameter defines a major force shaping a differentaspect of the large-scale Internet topology. Note that thismodel does not define all detailed correlations observed inInternet and/or the precise functional form of Internet’spath length and degree distribution (i.e.: the exact expo-nent of the scale-free distribution). This is a minimal setof universal parameters that any realistic Internet modelmust satisfy and it is a very good approximation to thelarge-scale topology.

The spatial distribution of Internet nodes is not ran-dom. It has been noticed a strong correlation betweenfractal distribution of cities and Internet nodes. The mea-sured fractal dimension is Df = 1.5. When generating theInternet-like network, the position of nodes is obtained bysampling a Rayleigh-Levy dust of the same fractal dimen-sion. The likelihood of placing a link between two nodessi and sj depends both on the (euclidean) link length wij

and linear preferential attachment:

Π(kj , dij) ≈kα

j

wσij

.

Page 100: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

248 The European Physical Journal B

Note that longer links cost more and thus will be se-lected with less probability. Traditional topology gener-ators based on Waxman model [17] wrongly assume expo-nential decay instead of linear cost decay. On the otherside, there is empirical evidence that highly connectednodes will be linked with higher probability. Increasingα will favor linking to nodes with higher degree, while ahigher σ will penalize longer links. From real measure-ments [1], it has been identified the position of Internet atDf = 1.5, σ = 1 and α = 1. Here, all numerical simula-tions have been performed using this topological arrange-ment (Fig. 1).

4 Path horizon and routing tables

An important ingredient when modeling network dynam-ics is to describe the paths followed by packets towardstheir destinations, that is, the routing policy. By prop-erly defining this routing algorithm, we will complete ourmodel’s definition.

Real routing protocols do not drive packets at random.Instead, they try to route packets along the most efficientroutes (i.e.: minimize distance or latency). At the sametime, it is unrealistic to assume that all packets followoptimal paths because the large amount of global infor-mation replicated at every single node. Clearly, the pathstraced by packets in real networks are properly character-ized as a trade-off between random diffusion and optimalrouting. This is reflected in the real Internet by its two-level routing organization. Nodes are grouped in so-calledAutonomous Systems (AS) and different routing rules ap-ply at each level. Intra-AS routing is based on shortestpath routing but inter-AS routing does not clearly followany minimization criteria.

A simple way to explore the cost/efficiency trade-off isby introducing a parameter defining the visibility scope ofthe node (depth of routing parameter m or node domaindiameter), that is, a sphere Γ (m)(si) of radius m centeredat every node si ∈ Ω (Fig. 1). We allow every node toknown every other node at a distance of m hops or lessbut no more. No information will be stored about nodesoutside the node domain. This idea is indicated in Fig-ure 2, where a given target node si is shown at the centerof its m-sphere. When a foreign node sj ∈ Ω − Γ (m)(si)send a packet towards si, while moving in the outside ofthe sphere it performs as a random walk. Once the packethits the boundary of the sphere, ∂Γ (m)(si), it is routedalong the shortest path [19].

5 Efficiency and network’s exploration

The depth of routing parameter m induces a hierarchyof path subsets over the entire set of available networkpaths (see Fig. 3). The random m = 0 and deterministicm = M routing represent the most general and restric-tive subsets, respectively. Increasing m will progressivelyreduce the randomness at routing decisions. Here, we find

si

sj

i(s )δm

m

Fig. 2. Network dynamics and routing: it involves a givendepth of routing m (a path horizon). A packet traveling from sj

to si within the δm(si) domain (of depth m) is deterministicallyrouted along the shortest path (d(j, i) ≤ m). The packet tra-verses hosts (squares) and routers (circles) indistinctly. Packetstraveling outside the m-domain (i.e. for d(j, i) > m) have morethan one path choice and perform random walks. As soon asthe packet enters into the m-domain, the packet is determin-istically routed along the shortest path. Here we would havem = 2.

m>D ( )max Ωmax0<m<D ( )Ω

paths

RW

m=0

SP

Fig. 3. Hierarchy of path sets defined by the routing policy.Every routing scheme explores a fraction of network paths (asubset of the entire path set P (Ω)). Pure random walk strat-egy (m = 0) visits all available paths, even shortest paths. Infact, random walk travel more frequently along those shortestspaths. The more restrictive set is associated to pure shortestpath routing (m ≥ [〈d〉]), which chooses a small fraction of allpossible routes on the network.

a nested collection of subsets corresponding to the inter-mediate situations 0 < m < M . The relative size of eachsubset can be approximated by measuring the fraction ofnodes F visited by packets:

F =1M

M∑i=1

θ [Ai] (12)

where Ai is the probability of visiting the node si. Thisterm depends both on the network dynamics and the

Page 101: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde and R.V. Sole: Internet’s critical path horizon 249

0 2 4 6 8 10Path horizon m

0.5

0.6

0.7

0.8

0.9

1.0

Fra

ctio

n o

f vi

site

d n

odes

Fig. 4. The fraction of visited nodes depends on the amountof order defined by the depth of routing parameter. Deter-minism sharply constraints the routing paths. Every pointwas obtained averaging over an ensemble of four different net-works and every single network was simulated in five differenthost configurations. Network parameters: M = 512, 〈k〉 = 4,ρ = 0.1, µ = 0.01, H = 10, Df = 1.5, σ = 1, α = 1 andT = 7× 105 steps.

geometrical situation of the node within the network. Con-siderable effort has been devoted to characterize the like-lihood of visiting a node in a purely static fashion. In thiscontext, two relevant centrality (or ‘load’ [20,21]) mea-sures are Random Walk Centrality [22] and BetweennessCentrality [23] which certainly represent the two extremesof our hierarchy of path subsets. Anyway, the correlationof these measurements with dynamic centrality Ai is weak,to say the least.

In Figure 4 the numerically measured F (m) for a fi-nite range of m values is shown. This fraction equals 1for random routing 0 ≤ m < [〈d〉], where 〈d〉 is theaveraged shortest path length and [x] denotes the inte-ger part of x. Random routing does not discard any net-work node. A sharp transition takes place from this point[〈d〉] < m ≤ M and a large fraction of network nodes(about 45 percent) are never visited by deterministic rout-ing. This severely restricts the diversity of routes andyields a load-insensitive system [24]. The system requiresa certain degree of noise in order to avoid sending pack-ets through already collapsed nodes while it is enabled tochoose less optimal but free routes. Note that depth ofrouting parameter m defines an order parameter becausequantifies the degree of order existing in the system. Theexistence of an order parameter confirms the critical na-ture of Internet traffic.

Numerical simulations have shown that flow is max-imized at the order-disorder transition point m = [〈d〉].This can be observed in network throughput reaching amaximum at this point (see Fig. 5). Throughput is de-fined by the quotient of the number of successfully releasedpackets and the sum of all packets generated during thesimulation. In the next section, we will show that several

0 5 10 15m

15

30

45

60

75

<T

>

0 5 10 15m

0.0

0.2

0.4

0.6

0.8

1.0

Eff

icie

ncy

0.10

0.15

0.20

0.25

0.00 5.00 10.00 15.00

0.10

0.15

0.20

0.25

0 5 10 15m

90

105

120

135

150

Lo

ad

m

λ

a b

c d

Fig. 5. Exploring the network traffic dependency on depthof routing parameter (m). The lines connect the points for il-lustrating purposes only. (A) Mean latency is considerably re-duced when enough system information is given to individualnodes. Most packets are forwarded along the minimum num-ber of hops. (B) Global throughput is optimal at the criti-cal point when path horizon is about the network diameter.(C) Mean packet rate also drops down at intermediate value.(D) Mean workload also experiences sudden transition fromheavy to light load. Network parameters: M = 250, 〈k〉 = 4,Df = 1.5, σ = 1, α = 1. Simulation parameters: ρ = 0.1,µ = 0.01, H = 10, T = 105 steps. The shape of these plotsdoes not depend on network size, that is, the optimal point isalways found at m = [〈d〉].

real statistics can be reproduced by the system dynamicspoised to criticality.

6 Average stretch

The average stretch s measures the efficiency of routingby comparing the number of hops h traversed by a packetto the shortest path distance d between source and desti-nation:

s =h

d.

Compact routing schemes minimize the average stretchwhile maintaining the size of routing tables small [25]. Re-ducing the average strech will progressively raise up thememory requirements at each router. Assuming Thorup-Zwick (TZ) compact routing scheme [26], the averagestretch can be expressed as a function of the distance dis-tribution and the graph size M only: 〈s〉 = f(〈d〉 , σd) [27].This TZ scheme ensures a nearly optimal lower memoryupper bound for 〈s〉 = 3 in generic networks. For scale-freenetworks, TZ achieves lower bounds. In particular, for theInternet interdomain graph (Autonomous Systems) withdegree distribution exponent γ ≈ 2.1 and M = 104, theTZ average stretch 〈s〉 ≈ 1.14 [27]. Also, the average num-ber of entries in the routing tables is approximately 52. Itturns out that 〈s〉 (〈d〉 , σd) surface has unique minimums.Strikingly, the points corresponding to Internet distancedistribution are very close to them [27]. This suggests that

Page 102: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

250 The European Physical Journal B

0 2 4 6 80

5

10

15

Str

ecth

fact

or

0 2 4 6 8Path horizon m

0

0.2

0.4

0.6

0.8

1

Mea

n #N

eigh

bors

/Siz

e

A

B

Fig. 6. (A) Average stretch (B) Average fraction of nodesinside m-domain is an estimation of the amount of informa-tion required by each node. Here open circles correspond tothe Internet’s nodel. For comparison, the simulation has beenrepeated in a poissonian network of same size and mean de-gree (filled circles). Network parameters: M = 500, 〈k〉 = 4,Df = 1.5, σ = 1, α = 1. Simulation parameters: ρ = 0.1,µ = 0.01, H = 10, T = 5 × 104 timesteps. Measurementwindow = 200. The distributions were obtained from an sta-tistical ensemble of three networks, every network simulatedthree times with hosts located at different configurations.

Internet topology is shaped by some hidden optimizationcriteria. Anyway, TZ scheme is not a realistic Internet in-terdomain routing scheme because assumes that globaltopology view is available. We have numerically measuredthe average stretch and the average fraction of neighboursat distance m for our routing scheme (see Fig. 6). Notethat the critical path horizon m = [〈d〉] is very close tothe minimum average stretch.

7 Network fluctuations and performance

In order to test the goodness of the model presented here,it is interesting to compare some real Internet statisticswith their respective model measurements. In particular,those measurements will be collected for the model at thecritical path horizon, that is, when m = [〈d〉].

When understanding the competition between a net-work’s internal collective dynamics (i.e.: Internet traffic)and external environmental changes (i.e.: traffic sourcesor host behaviour), it is useful to study the relationship

10-1

100

101

102

103

Mean flow

100

101

102

σ

10-1

100

101

102

103

100

101

102

10-1

100

101

102

103

Mean flow

100

101

102

σ

10-1

100

101

102

103

100

101

102

0.5 0.5

flow in flow out

Fig. 7. The relationship between fluctuations and the averageincoming node flux (a similar distribution holds for averageoutgoing router flux). The plot shows that both quantities arerelated by a power law of exponent 1/2, which is consistentwith the measurements from Internet routers. Network param-eters: M = 500, 〈k〉 = 4, Df = 1.5, σ = 1, α = 1. Simulationparameters: ρ = 0.1, µ = 0.01, H = 10, T = 10000 steps. Mea-surement window = 200. The distributions were obtained froman statistical ensemble of 10 networks, every network simulated5 times with hosts located at 5 different configurations.

between the mean flux and the size of fluctuations aroundthe average [10]. Previous explanations of occurrence ofself-similarity in traffic networks are based on the su-perposition of many and high-variable (infinite variance)sources [11]. This point of view discards the effects of thesystem’s collective dynamics and considers that Internetis an externally driven system. Real data shows that Inter-net dynamics can not be simply reduced to the behaviourof traffic sources.

Let us define the incoming (outgoing) flux fi as theamount of packets received (forwarded) at router si duringa given and fixed period of time. For every router, comparethe average flux 〈fi〉 with the dispersion σi around themean. It has been noticed that for several real systemsthe following scaling relation holds:

σ ≈ 〈f〉α

where α is an exponent which can take the values of 1/2and 1 [10]. This suggests that real systems can be classifiedin two main classes depending on the value of this expo-nent. The relevant exponent in this context is α = 1/2,which has been observed in daily traffic measurements at374 geographically distinct Internet routers [10]. Systemsexhibiting the 1/2 exponent are representative of endoge-nous dynamics, that is, determined by the system’s in-ternal collective fluctuations. Moreover, measurements onour model reproduce the same exponent (see Fig. 7).

A measure of Internet end-to-end performance is thenormalized latency time τ [28] defined as the quotient be-tween latency time L (measured as Round-Trip-Time) and

Page 103: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde and R.V. Sole: Internet’s critical path horizon 251

geographical distance w:

τ =L

w.

There are several factors governing τ . First, propa-gation rate is finite. There is a minimum delay becausepackets can not move faster than speed of light. Second,the number of nodes traversed departs from the minimumfound along the geodesic path from node to destination.And third, packets spent some time at every intermediatenode because queueing delays. Define τmin = Lmin/w asthe normalized latency without taking into account queuedelays and τav = Lav/w as the normalized latency whileconsidering all factors affecting packet latency.

Their probability distributions have been measuredfrom two years of PingER [29] data and follow power-law scaling with stable exponents of about −3.0 for τmin

and −2.5 for τav [28]. Numerical simulations on the modelpoised to criticality reproduced the power-law P (τav)probability distribution. The exponent of the distributionis about −2.45, which is very close to real observations(see Fig. 8). The previous results seems to be quite robustand independent of most model parameters changes. Aninteresting thing to note is that current model cannot re-produce the existing correlation between τmin and τav andgeographical distance w reported by previous studies [28].The reason might be that propagation rate is (unrealis-tically) assumed to be infinite in the model. It might bethat this factor has little influence in shaping the τav prob-ability distribution. Moreover, it is worth noting that anydeviation of the order parameter from the critical regimeresulted in an exponential distribution for τav, reinforc-ing the view of an optimally efficient and self-organizedInternet.

8 Discussion

In this paper we have shown that a simple model of traf-fic dynamics incorporating the appropriate Internet’s net-work topology is able to recover several statistical fea-tures of real traffic. More important, we have seen that therouting algorithms can take advantage of the small-worldstructure of the web by reaching a critical path horizonclose to the network’s average path length. In doing so,a highly efficient system is reached at low cost: routingstrategies only need to consider a small depth. Once them parameter reaches the network’s diameter, no furtherinformation is required to properly reach the target. Afull-system deterministic routing strategy is actually un-necessary and would be too costly. Instead, the constraintsimposed by network’s architecture allow to exploit the im-plicit information defined by the small-world architecture.

It might be that Internet evolves in a way thatthroughput or global performance is maximized. Thistrend is constrained within the limits of available commu-nication resources. Inside this regime, Internet is shapedin order to provide better response. This is reflected in theoptimal and lower average stretch observed in real Inter-net, which is close to global minimum [26]. This suggest

10-1

100

101

t10

-6

10-5

10-4

10-3

10-2

P(t

)

100

10-6

10-5

10-4

10-3

10-2

av

av

H=10

H=5

-2.5

Fig. 8. Normalized latency τav distributions at the criticalpoint m = [〈d〉] follows power law of exponent ∼ −2.45 andfits very well the real measurements (see text). Here, we plotthe distributions for H = 5 (shaded line) and H = 10 (contin-uous line) showing that the long tail does not depend on themaximum queue size. Any deviation from m = 〈d〉 results inan exponential distribution, deviating from real observations.Network parameters: M = 1000, 〈k〉 = 4, Df = 1.5, σ = 1,α = 1. Simulation parameters: ρ = 0.1, µ = 0.01, H = 5, 10and T = 5×105 steps. The distributions were obtained from anstatistical ensemble of 10 different networks and every networkwas simulated five times with different host arrangements.

hat some hidden optimization is at work. Clearly, an unre-sponsive system will be no good. How a distributed collec-tion of designers were able to define this globally efficientinfrastructure is a question that deserves attention.

This paper is dedicated to the memory of Per Bak. This workwas supported by a grant BFM2001-2154 and by the Santa FeInstitute.

References

1. S.-H. Yook, H. Jeong, A-L. Barabasi, Proc. Natl. Acad.Sci. USA 99, 13382, (2002)

2. B.A. Huberman, R.M. Lukose, Science 277, (1997)3. N. Wiener, Cybernetics (John Wiley and Sons, New York,

1949)4. J.H. Cowie, D.M. Nicol, A.T. Ogielski, Comput. Sci.

Engin. 1, 42 (1999)5. W. Willinger, R. Govindan, S. Jamin, V. Paxson, S.

Shenker, PNAS 99 (Suppl. 1), 2573 (2002)6. R.V. Sole, S. Valverde, Physica A 289, 595 (2001)7. S. Valverde, R.V. Sole, Physica A 312, 636 (2002)8. N. Hohn, D. Veitch, P. Abry, ACM/SIGCOMM Internet

Measurement Workshop (Marseille, France, 2002) pp. 63–68

9. K. Fukuda, A Study of Phase Transition Phenomena inInternet Traffic, Ph.D. thesis, Keio Univ. (1999)

10. M. Argollo de Menezes, A.-L. Barabasi, cond-mat/0306304(2003)

11. W. Willinger, M.S. Taqqu, R. Sherman, D.V. Wilson,IEEE/ACM Trans. on Networking 5, 71 (1997)

Page 104: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

252 The European Physical Journal B

12. H. Fuks, A.T. Lawniczak, adap-org/9909006 (2001)13. M. Faloutsos, P. Faloutsos, C. Faloutsos, ACM SIGCOMM

29, 251 (1999)14. K. Bolding, M.L. Fulgham, L. Snyder, Tech. Rep. CSE-94-

02-04 (1994)15. P.M.B. Vitanyi, SIAM J. Comput. 17 4, 659 (1988)16. G. Bilardi, F.P. Preparata, CS-93-20, Dept. Comp. Sci.,

Brown Univ. (1993)17. B. Waxman, IEEE J. Selec. Areas Commun., SAC-6(9),

1617 (1988)18. A. Vazquez, R. Pastor-Satorras, A. Vespignani,

cond-mat/0206084 (2002)19. When the link decision is ambiguous (more than one link

can be selected) the less visited link until the momentis chosen (this could be implemented by maintaining acounter of the number of packets forwarded through thelink)

20. J. Dong Noh, H Rieger, cond-mat/0307719 (2003)21. K.-I. Goh, B. Kahng, D. Kim, Traffic and Granular Flow

’01 (Springer, Berlin, 2003)22. M.E.J. Newman, cond-mat/0309045 (2003)23. L.C. Freeman, Sociometry 40, 35 (1979)24. D.H. Lorenz, A. Orda, D. Raz, Y. Shavitt, TR-2001-17,

DIMACS (2001)25. L.J. Cowen, Proc. of the 10th Annual ACM-SIAM Symp.

on Discrete Algorithms (1999)26. M. Thorup, U. Zwick, Proc. 33th Annual ACM

Symposium on Theory of Computing (SPAA), 1-10 (2001)27. D. Krioukov, K. Fall, X. Yang, cond-mat/0308288 (2003)28. R. Percacci, A. Vespignani, cond-mat/0209619 (2002)29. Internet End-to-end Performance Monitoring,

http://www-iepm.slac.stanford.edu.

Page 105: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

4.6. SCALE-FREE NETWORKS FROM OPTIMAL DESIGN 105

4.6 Scale-Free Networks from Optimal Design

Page 106: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Europhysics Letters PREPRINT

Scale-free Networks from Optimal Design

S. Valverde1, R. Ferrer Cancho1 and R. V. Sole1,2

1 ICREA-Complex Systems Lab, IMIM-UPF, Dr Aiguader 80, Barcelona 08003, SPAIN2 Santa Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA.

PACS. 05.10.-a – Computational methods in statistical physics.PACS. 05.65.+b – Self-organizing systems.

Abstract. – A large number of complex networks, both natural and artificial, share thepresence of highly heterogeneous, scale-free degree distributions. A few mechanisms for theemergence of such patterns have been suggested, optimization not being one of them. In thisletter we present the first evidence for the emergence of scaling (and the presence of smallworld behavior) in software architecture graphs from a well-defined local optimization process.Although the rules that define the strategies involved in software engineering should lead toa tree-like structure, the final net is scale-free, perhaps reflecting the presence of conflictingconstraints unavoidable in a multidimensional optimization process. The consequences forother complex networks are outlined.

Two basic features common to many complex networks, from the Internet to metabolicnets, are their scale-free (SF) topology [1] and a small-world (SW) structure [2, 3]. Thefirst states that the proportion of nodes P (k) having k links decays as a power law P (k) ∼k−γφ(k/ξ) (with γ ≈ 2 − 3) [1, 4, 5] (here φ(k/ξ) introduces a cut-off at some characteristicscale ξ). Examples of SF nets include Internet topology [4,6], cellular networks [7,8], scientificcollaborations [9] and [10] lexical networks. The second refers to a web exhibiting very smallaverage path lengths between nodes along with a large clustering [2, 3].

Although it has been suggested that these nets originate from preferential attachment [4],the success of theoretical approximations to branching nets from optimization theory [11,12]would support optimality as an alternative scenario. In this context, it has been shown thatminimization of both vertex-vertex distance and link length (i.e. Euclidean distance betweenvertices) [13] can lead to the SW phenomenon. In a similar context, SF networks have beenshown to originate from a simultaneous minimization of link density and path distance [14].Optimal wiring has also been proposed within the context of neural maps [15]: ’save wiring’is an organizing principle of brain structure. However, although the analysis of functionalconnectivity in the cerebral cortex has shown evidence for SW [16], the degree distribution isclearly non-skewed but single-scaled (i. e. ξ is very small).

The origin of highly heterogeneous nets is particularly important since it has been shownthat these networks are extremely resilient under random failure: removal of randomly chosennodes (tipically displaying low degree) seldom alters the fitness of the net [17]. However, whennodes are removed by sequentially eliminating those with higher degree, the system rapidlyexperiences network fragmentation [17,18].

c© EDP Sciences

Page 107: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

2 EUROPHYSICS LETTERS

10-0

101

102

k

10-0

101

102

103

104

Cum

ulat

ive

freq

uenc

y

-1.65

-1.50

ba

Fig. 1 – (a) One of the largest components of the java net (Ω2, displays scale-free and small worldbehavior (see text). In (b) the cumulative frequencies P>(k) are shown for the two largest components.We have P>(k) ∼ k−γ+1, with γ1 = 1.5± 0.05 and γ2 = 1.65± 0.08.

Artificial networks offer an invaluable reference when dealing with the rules that underlietheir building process [19]. Here we show that a very important class of networks derived fromsoftware architecture maps, displays the previous patterns as a result of a design optimizationprocess.

The importance of software and understanding how to build efficiently software systemsis one of our major concerns. Software is present in the core of scientific research, economicmarkets, military equipments and health care systems, to name a few. Expensive costs (thou-sands of millions of dollars) are associated with the software development process. In the past30 years we have assisted to the birth and technological evolution of software engineering,whose objective is to provide methodologies and tools to control and build software efficiently.Software engineers conceive programs with graphs as architects use plans for buildings. Thesoftware architecture is the structure of the program. The building blocks are software com-ponents and links are relationships between software components. The interactions betweenall the components yields the program functionality. Class diagrams constitute a well-knownexample of such graphs [20]. In this case, software components are also known by the technicalterm class. We have analysed the class diagram of the public Java Development Framework 1.2(JDK1.2) [21], which is a large set of software components widely used by Java applications,as well as the architecture of a large computer game [22].

These are examples of highly optimized structures, where design principles call for diagramcomprehensibility, grouping components into modules , flexibility and reusability (i.e. avoidingthe same task to be performed by different components) [23]. Although the entire plan iscontrolled by engineers, no design principle explicitely introduces preferential attachment norscaling and small-worldness. The resulting graphs, however, turn out to be SW and SF nets.

The software graph is defined by a pair Ωs = (Ws, Es), where Ws = si, (i = 1, ..., N) isthe set of N = |Ω| classes and Es = si, sj is the set of edges/connections between classes.The adjacency matrix ξij indicates that an interaction exists between classes si, sj ∈ Ωs

Page 108: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R. Ferrer Cancho and R. V. Sole: Scale-free Networks from Optimal Design3

(ξij = 1) or that the interaction is absent (ξij = 0). The average path lenght l is given by theaverage l = 〈lmin(i, j)〉 over all pairs si, sj ∈ Ωs, where lmin(i, j) indicates the length of theshortest path between two nodes. The clustering coefficient is defined as the probability thattwo classes that are neighbors of a given class are neighbors of each other. Poissonian graphswith an average degree k are such that C ≈ k/N and the path length follows [3]:

l ≈ log N

log(k)(1)

C is easily defined from the adjacency matrix, and is given by:

C =

⟨2

ki(ki − 1)

N∑j=1

ξij

[ ∑k∈Γi

ξjk

]⟩Ωs

(2)

It provides a measure of the average fraction of pairs of neighbors of a node that are alsoneighbors of each other.

The building process of a software graph is done in parallel (different parts are build andgradually get connected) and is assumed to follow some standard rules of design [20,23]. Noneof these rules refer to the overall organization of the final graph. Essentially, they deal withoptimal communication among modules and low cost (in terms of wiring) together with therule of avoiding hubs (classes with large number of dependencies, that is, large degree). Theset of bad design practices, such as making use of large hubs, is known as antipatterns inthe software literature: see [24]. The development time of the application should be as shortas possible because the expensive costs involved. It is argued in literature [23] that there isan optimum number of components so that cost of development is minimized, but it is notpossible to make a reliable prediction about this number. Adding new software componentsinvolves more cost in terms of interconnections between them (links). Conversely, the cost persingle software component decreases as the overall number of components (nodes) is increasedbecause the functionality is spread over the entire system. Intuitively, a trade-off between thenumber of nodes and the number of links must be chosen.

However, we have found that this (local) optimization process results in a net that exhibitsboth scaling and small-world structure. First, we analyzed JDK1.2 network has N = 9257nodes and Nc = 3115 connected components, so that the complete graph Ωs is actually givenby Ωs = ∪iΩi, where the set is ordered from larger to smaller components (|Ω1| > |Ω2| >... > |ΩNc

|). The largest connected component, Ω1, has N1 = 1376, with < k >= 3.16and γ = 2.5, with clustering coefficient [4] is C = 0.06 À Crand = 0.002 and the averagedistance l = 6.39 ≈ lrand = 6.28, i.e. it is a small-world. The same basic results areobtained for Ω2 (shown in fig. 1a): here we have N2 = 1364, < k >= 2.83 and γ = 2.65,C = 0.08 À Crand = 0.002 and l = 6.91 ≈ lrand = 6.82.

The degree distribution for the two largest components is shown in figure 1b, where wehave represented the cumulative distribution

P>(k; Ωi) =N(Ωi)∑k′≥k

p(k′,Ωi) (3)

for i = 1, 2. We can see that the largest components display scaling, with estimated exponentsγ ≈ 2.5− 2.65.

Similar results have been obtained from the analysis of a computer game graph [22]. Thisis a single, complex piece of software which consists of N = 1989 classes involving different

Page 109: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

4 EUROPHYSICS LETTERS

n0

n1

n9

n17

n3

n4

n6

n8

n15

n16

n5

n11

n13

n22

n7

n12

n2

n10

n18

n19

n20

n14

n21c

n0

n58

n3

n4

n5

n7

n10

n11

n12

n13

n14

n15

n16

n17

n18

n19

n20

n21

n22

n23

n24

n25

n26

n27

n28

n29

n30

n31

n32

n33

n35

n37

n38

n39

n40

n41

n42

n43

n44

n45

n46

n47

n48

n49

n50

n52

n53

n54

n55

n56

n57

n1

n2

n6

n8

n9

n51

n36

n34b

n0

n125

n109

n127

n5

n8

n9

n23

n37

n76

n81

n84

n88

n89

n90

n91

n92

n93

n94

n95

n96

n97

n99

n104

n107

n108

n126

n128

n144

n173

n174

n190

n195

n204

n209

n213

n214

n231

n232

n233

n236

n240

n241

n245

n246

n251

n254

n1

n71

n188

n110

n111

n113

n114

n115

n120

n121

n122

n129

n130

n131

n132

n134

n242

n253

n2

n3

n4

n6

n98

n100

n101

n102

n103

n138

n224

n243

n244

n247

n248

n249

n250

n7

n10

n11

n12

n13

n14

n15

n16

n17

n18

n19

n20

n21

n22

n24

n25

n26

n27n28

n29

n30

n31

n32

n34

n33

n35

n36

n39

n54

n38

n40

n41

n42

n43

n44

n45

n46

n47

n48

n49

n52

n53

n55

n56

n57

n58

n61

n62

n63

n64

n65

n66

n67

n68

n69

n59

n60

n50

n51

n70

n72

n73

n74

n75

n77

n78

n79

n80

n82

n83

n85

n87

n189

n191

n192

n193

n194

n196

n197

n198

n199

n200

n201

n202

n203

n205

n206

n211

n86

n105

n106

n116

n117

n118

n119

n123

n124

n133

n135

n136

n252

n112

n143

n166

n171

n172

n175

n176

n177

n178

n179

n180

n181

n182

n183

n184

n185

n186

n187

n208

n210

n212

n215

n216

n218

n219

n221

n222

n223

n226

n227

n234

n235

n237

n238

n137

n139

n140

n141

n142

n145

n146

n147n148

n149

n150

n151

n152

n153

n154

n155

n156

n157

n158

n159

n160

n161

n162

n163

n164

n165

n167

n168

n169

n170

n207

n220

n225

n228

n229

n230

n217

n239

d

101

102

103

104

N

0

1

2

3

4

5

l lo

g<k>

a

c

b

d

Fig. 2 – (a) Using the 32 connected components with more than 10 classes (nodes), the l log(k)−Nplots is shown. As predicted from a SW structure, the components follow a straight line in thislinear-log diagram. Three subwebs are shown (c-d), displaying hubs but no clustering (their locationis indicated in (a)). The black square corresponds to the computer game graph.

aspects like: real-time computer graphics, rigid body simulation, sound and music playing,graphical user interface and memory management. The software architecture graph for thegame has a large connected component that relates all subsystems. The cumulative degreefrequency for the entire system is scale-free, with γ = 2.85± 0.11. The network also displaysSW behaviour: the clustering coefficient is C = 0.08 À Crand = 0.002 and the averagedistance is l = 6.2, close to lrand = 4.84.

These results reveal a previously unreported global feature of software architecture whichcan have important consequences in both technology and biology. This is, as far as we know,the first example of a scale-free graph resulting from a local optimization process insteadof preferential attachment [4] or duplication-rewiring [25, 26] rules. Since the failure of asingle module leads to system’s breakdown, no global homeostasis has been at work as anevolutionary principle, as it might have occured in cellular nets. In spite of this, the finalstructure is very similar to those reported from the analysis of cellular networks. Second, ourresults suggest that optimization processes might be also at work in the latest, as it has beenshown to occur in transport nets [11].

Complex biosystems are often assumed to result from selection processes together witha large amount of tinkering [27]. By contrast, it is often assumed that engineered, artificialsystems are highly optimized entities, although selection would be also at work [28]. Suchdifferences should be observable when comparing both types, but the analysis of both naturaland artificial nets indicates that they are often remarkably similar, perhaps suggesting generalorganization principles. Our results support an alternative scenario to preferential attachmentbased on cost minimization together with optimal communication among units [14] process.The fact that small-sized software graphs are trees (as one would expect from optimizationleading to hierarchical structures, leading to stochastic Cayley trees [6]) but that clusteringemerges at larger sizes might be the outcome of a combinatorial optimization process: As thenumber of modules increases, the conflicting constraints that arise among different parts of

Page 110: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde, R. Ferrer Cancho and R. V. Sole: Scale-free Networks from Optimal Design5

the system would prevent reaching an optimal structure [29]. Concerning cellular networks,although preferential linking might have been at work [30], optimization has probably playeda key role in shaping metabolic pathways [31–33]. We conjecture that the common origin ofSF nets in both cellular and artificial systems such as software might stem from a process ofoptimization involving low cost (sparse graph) and short paths. For cellular nets (but not intheir artificial counterparts) the resulting graph includes, for free, an enormous homeostasisagainst random failure.

∗ ∗ ∗

The authors thanks Javier Gamarra, Jose Montoya, William Parcher, Charles Hermanand Marcee Herman for useful comments. This work was supported by the Santa Fe Institute(RFC and RVS) and by grants of the Generalitat de Catalunya (FI/2000-00393, RFC) andthe CICYT (PB97-0693, RVS).

1. Albert, R. and Barabasi, A.-L., cond-mat/0106096.

2. Watts, D. J. & Strogatz, S. H. Nature 393 (1998) 440.

3. Newman, M. E. J. J. Stat. Phys. 101 (2000) 819.

4. Barabasi, A.-L. & Albert, R. Science 286 (1999) 509.

5. Amaral, L. A. N., Scala, A., Barthelemy, M. & Stanley, H. E. Proc. Natl. Acad. Sci.USA 97 (2000) 11149.

6. Caldarelli, G., Marchetti, R. Pietronero, L. (2000) Europhys. Lett. 52, 304.

7. Jeong, H., Mason, S., Barabasi, A. L. and Oltvai, Z. N. Nature 411 (2001) 41.

8. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabasi, A.-L. Nature 407 (2000)651.

9. Newman, M. E. J. Proc. Natl. Acad. Sci. USA 84 (2001) 404.

10. Ferrer i Cancho, R. and Sole, R. V. Procs. Roy. Soc. London B, 268 (2001) 2261.

11. West, B. and Brown, J. Scaling in Biology, Oxford, New York (2000).

12. Rodriguez-Iturbe, I. and Rinaldo, A. Fractal River Basins, Cambridge U. Press, Cam-bridge (1997).

13. Mathias, N. and Gopal, V. Phys. Rev. E 63 (2001) 1.

14. Ferrer Cancho, R. and Sole, R. V., SFI Working paper 01-11-068.

15. Cherniak, C. Trends Neurosci. 18, 522-527 (1995).

16. Stephan, K. A., Hilgetag, C.-C., Burns, G. A. P. C, O’Neill, M. A., Young, M. P. andKotter, R. Phil. Trans. Roy. Soc. B 355 (2000) 111.

17. Albert, R. A., Jeong, H. and Barabasi A.-L. Nature, 406 (2000) 378.

18. Sole, R. V. and Montoya, J. M. (2001) Procs. Royal Soc. London B 268, 2039.

Page 111: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

6 EUROPHYSICS LETTERS

19. Ferrer Cancho, R. Janssen, C. and Sole, R. V. (2001) Phys. Rev. E, 63 32767.

20. Gamma, E., Helm R., Johnson R., Vlissides J. (1994) Design Patterns Elements ofReusable Object-Oriented Software (Addison-Wesley, New York)

21. Sun, Java Development Kit 1.2. Web site: http://java.sun.com/products/java/1.2/

22. UbiSoft ProRally 2002: http://ubisoft.infiniteplayers.com/especiales/prorally/

23. Pressman, R. S. (1992) Software Engineering: A Practitioner’s Approach, (McGraw-Hill)

24. Brown, W. H., Malveau, R., McCormick, H., Mowbray, T., and Thomas, S. W. (1998)Antipatterns: Refactoring Software, Architectures, and Projects in Crisis, (John Wiley& Sons, New York)

25. Sole, R. V., Pastor-Satorras, R., Smith, E. D. and Kepler, T. (2002). Adv. ComplexSyst. (in press)

26. Vazquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2001) cond-mat/0108043.

27. Jacob, F. (1976) Science 196, 1161-1166.

28. Monod, J. (1970) Le hasard et la necessite, Editions du Seuil, Paris.

29. Kauffman, S. A. (1993) Origins of Order, Oxford, New York.

30. Wagner. A. and Fell, D. A. Proc. Roy. Soc. London B 268 (2001) 1803.

31. Mittenthal, J.E., A. Yuan, B. Clarke, and A. Scheeline (1998) Bull. Math. Biol. 60,815-856.

32. Melendez-Hevia, E. Waddell, T. G. and Shelton, E. D. Biochem. J. 295, 477.

33. Melendez-Hevia, E. Waddell, T. G. and Montero, F. J. Theor. Biol. 166 (1994) 201.

Page 112: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

112 CHAPTER 4. ARTICLES

4.7 Network Motifs in Computational Graphs: A CaseStudy in Software Architecture

Page 113: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Network motifs in computational graphs: A case study in software architecture

Sergi Valverde1 and Ricard V. Solé1,2

1ICREA-Complex Systems Lab, Universitat Pompeu Fabra, Dr. Aiguader 80, 08003 Barcelona, Spain2Santa Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA

Received 1 July 2004; revised manuscript received 3 June 2005; published 8 August 2005

Complex networks in both nature and technology have been shown to display characteristic, small subgraphsso-called motifs which appear to be related to their underlying functionality. All these networks share acommon trait: they manipulate information at different scales in order to perform some kind of computation.Here we analyze a large set of software class diagrams and show that several highly frequent network motifsappear to be a consequence of network heterogeneity and size, thus suggesting a somewhat less relevant roleof functionality. However, by using a simple model of network growth by duplication and rewiring, it is shownthe rules of graph evolution seem to be largely responsible for the observed motif distribution.

DOI: 10.1103/PhysRevE.72.026107 PACS numbers: 89.75.Fb, 89.20.Ff, 87.80.Vt

I. INTRODUCTION

Many natural and artificial systems are describable as net-works of interacting components 1–4. The network is amedium that allows resource sharing often involving an ef-ficient transport of energy metabolism, power grid, matterhighways, airport webs, or information cellular communi-cation, Internet. The architecture of complex networks canbe explored at different scales, from the overall propertiesdefined by average measures such as path length or cluster-ing, correlations, or degree distributions to the more funda-mental features displayed by small subsystems. In this con-text, it has been shown that some special, small subgraphs—so-called motifs—seem to be particularly relevant indescribing the architecture of complex networks 5. Motifshave been suggested to be the functional building blocks ofnetwork complexity. Are some subgraphs more common thanothers because their functional relevance?

An alternative view is that the rules of network growthcan by themselves favor some subgraphs with no specialrelation to the underlying functionality. Actually, this seemsto be the case for the structure of the protein-protein interac-tion map. In spite of the fact that proteins perform functions,the overall architecture of the protein network is easily re-produced by means of a simple model of node duplicationplus rewiring 6. Such properties include scale freeness,small-world features, and even hierarchical organization andprotomodularity 7. Mounting evidence suggests that manykey features of complex networks including motifs mightbe strongly tied to the global network structure 8.

If functional constraints to network architecture have tobe considered, one particularly relevant aspect of networkcomplexity is associated with the presence of some underly-ing computational process. Computation is a key ingredientof any complex adaptive system CAS. By storing and pro-cessing information, CAS’s are able to predict and adapt toexternal fluctuations. Computation occurs in both natural andartificial systems 9, although the building process that cre-ates the computational structure is different. This is actuallyone of the most important points here: are the rules of de-signed and evolved systems completely different? Biologicalnetworks are largely originated through tinkering 10–12:new components are obtained by re-using old ones, mainly

by duplicating them. In spite of the apparent limitations ofsuch mechanisms, it allows one to discover good designs13. More complex computations can be developed as thenetwork size is increased and new functions can emerge.

How is computation linked to network structure? Tenta-tive answers, to be developed here, can be obtained by look-ing at a very important class of computationally driven net-works: software systems. They offer a unique opportunity ofexploring different levels of complexity with well-definedfunctional traits. As opposed to most examples of evolvingcomputational networks, extensive databases storing soft-ware evolution registers exist and involve a high degree ofdetail. Here we analyze the largest data set of software mapsexplored to date 83 different systems. The main goal of ourstudy is to see if functionality, as opposed to network evolu-tion, is a main constraint to the distribution of network mo-tifs in real graphs. The paper is organized as follows. In Sec.II an overview of the software systems analyzed here isgiven. In Sec. III, the statistical patterns of network motifs ina large set of software systems are presented and the pres-ence of scaling relations and the size-dependent frequency ofmotifs analyzed. In Sec. IV a model of duplication and re-wiring is used in order to reproduce the structure of motifs ofa large software map. In Sec. V a general discussion is pre-sented.

II. SOFTWARE NETWORKS

Programming languages describe software systems 14.Every computer program has a textual representation follow-ing syntactic rules dictated by a programming language 15.The program is decomposed in a number of simpler softwareentities or logical elements, which are given a unique name.Software entities include things as data objects, instructions,subprograms, or modules. A hierarchy or natural orderingbetween software entities is prompted by modern program-ming languages. At the lowest level, a program is viewed asa sequence of simple machine instructions. Sequences of re-lated instructions are enclosed in subprograms. At the highestlevel, there are modules or logical containers grouping sim-pler software entities. Often, modules are defined as func-tional blocks but there are no strict principles driving modulecomposition.

PHYSICAL REVIEW E 72, 026107 2005

1539-3755/2005/722/0261078/$23.00 ©2005 The American Physical Society026107-1

Page 114: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

It is useful to depict the complex structures defined incomputer programs by means of a graph 16, where nodesrepresent software entities and links represent syntactical re-lationships between modules, subprograms, and instructionssee Fig. 1. In this paper, we will focus on a particularlyinteresting subset of software entities: the collection of mod-ules and their static interactions also called software archi-tecture. While it is widely acknowledged that software evo-lution depends on its architecture, very little is known aboutthe cause and effect relationships between design practicesand evolution outcomes 17. In order to understand the re-lation between structure and artificial evolution, we have en-visaged a network model of software architecture, hereaftercalled the software map or software network. Here, we showhow the evolution of computer programs can be understoodby recovering and analyzing their software networks at dif-ferent stages of development.

Following 16,18, Fig. 2 shows the text of an incompleteC 19 computer program see Fig. 2a and its corre-sponding software map see Fig. 2b. The program textreads from left to right and top to bottom. The softwarenetwork = V ,L of this C program is recovered bymeans of a very simple lexical analysis. First, we identify thevector of all module names also class in C givenby W= wi= point,chessmen,point,move,point,point,pawn,chessmen,move. Name ordering is important whenrecovering module dependences see below. Names wi thatappear in the head of a module declaration provide the set of

network nodes V= vi. These names hereafter called mod-ule definitions are easily identified because they are pro-ceeded by the C keyword class. Remaining names arecalled module references. In this example, we have four N= V=4 unique module definitions: point w1, chessmenw2, move w4, and pawn w7. This defines a mappingfrom names wkW and network nodes viV in the softwarenetwork.

The design of any nontrivial function involves the inter-action of at least two modules 20. Static module interac-tions can be depicted from relative positions of names in W.Let us assume that wk w1 ,w2 ,w4 ,w7 is a module defini-tion associated with node vi and wl w1 ,w2 ,w4 ,w7 is amodule reference associated with node v j. A directed linkvi→v jL signals a dependence from module definition wk

to module reference wl. Link directionality reflects name or-dering in the C program—that is, k l. There are twotypes of module dependences: association also “has a” rela-tionship or inheritance or “is a” relationship. The purposeof these dependences is to establish a logical organization ofthe system. However, our analysis is centered an the study oftopological patterns and does not take into account detailedrelationship semantics. In an association, referenced node v jis nested in the C block of module vi. This block isalways bracketed by the symbols and. In an inheritance,referenced node v j always follows the C sequence: publicafter the referencing node vi see Fig. 2a. Repeated linksare not considered in the following analysis.

Software maps capture the topology of complex softwaresystems. In particular, these maps provide a quantitative ap-proach to the evolution of technology. They are actuallyevolving entities and somewhat inhabit an intermediate zonebetween computing machines and neural structures. We haveshown software networks to be scale free and small world16,18,21. Software networks can be described under a sta-tistical physics perspective.

III. SOFTWARE MOTIFS

In this section, we extend our previous topological studiesby analyzing software networks at the level of network sub-graphs, or subsets of connected nodes in a network. Thestatistics of subgraphs provides important information aboutnetwork structure. It has been claimed that overrepresentedsubgraphs i.e., motifs signal key building blocks of net-works 5. This might be the case for regulatory networks,where specific subgraphs i.e., feed-forward loops performinformation processing functions 22. A particular class ofsubgraphs, cycles, have received considerable interest. Cycli-cal dependences in software maps imply that a module isrelated to itself, which may be acceptable, unacceptable, orrequired 23. Ambiguity in the functional meaning of cyclessuggests that subgraphs in software graphs are not strictlyrelated to well-defined functions. The ubiquity of subgraphsin software networks seems to be a consequence of top-downmechanisms of software organization and not a consequenceof selective pressures.

Following the method outlined in previous section, wehave recovered and analyzed a large dataset of software

FIG. 1. Examples of common network motifs with n=4 ele-ments found in software graphs. Here each node is a class andarrows indicate static dependences among classes see text.

FIG. 2. a A piece of C code from a chess-playing programand b the corresponding software map or network model display-ing the collection of modules and their logical dependences. Theonly information required to recover the software map is the set ofmodule names highlighted in bold in a and their relative loca-tions in the C program see text. Notice how nodes are labeledwith their names and links are decorated with relationship type.

S. VALVERDE AND R. V. SOLÉ PHYSICAL REVIEW E 72, 026107 2005

026107-2

Page 115: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

maps from 83 reverse-engineered C software sys-tems. A given graph can be characterized by a degree se-quence. For the whole graph each node has a degree se-quence given by the in-degree list Ki and the out-degree listRi with i=1, . . . ,N. The lists would be completed by theso-called mutual edges Mi—i.e., cases where there is a pairof edges in both directions between two nodes. For eachsubgraph i of size n here n=3,4, another degree se-quence would be provided by two new lists, now kj and rjfor the in- and out-degrees, respectively. For example, for theleft subgraph in Fig. 1, we would have kj= 0,1 ,1 ,2 andrj= 2,1 ,1 ,0.

Network motifs are defined in terms of subgraphs whichappear much more often than expected from pure chance.Specifically, they occur with a high frequency compared withthe expected from an ensemble of randomized graphs withidentical degree structure 5. The random networks are gen-erated by means of the switching rule. For every pair of linksi→ j and u→v in the original software network, we add the

pair i→v and j→u in the randomized network. This rulekeeps intact the in- and out-degree sequences but destroysdegree-degree correlations. The statistical significance of agiven subgraph i is described by its Z score 5, defined as

Zi =Nreali − Nrandi

Nrandi. 1

Here Nreali is the number of times the subgraph ap-pears in the network, whereas Nrandi and Nrandirefer to the mean and standard deviation SD of its appear-ances in the randomized ensemble, respectively. In order tobe significant, it is required that Zi2. When Zi2 Zi−2 the motif antimotif is considered to bemore less common than expected from random. In Fig. 3the results from our analysis are shown for some typicalsoftware networks. A handful of these subgraphs appear tobe present in all software systems analyzed and also in bothelectronic circuits and biological networks involving compu-

FIG. 3. Network motifs with n=3,4 elements found in software graphs. The numbers of node and edges for each network are shown. Themost frequent motifs were classified in distinct rows for each type of system: medium software systems, large software systems, generegulatory nets, neural networks, and digital electronic circuits. For each motif, we display the number of occurrences in the network Nreal,the number of occurrences Nrand±SD in a set of 100 randomized network versions, and a qualitative measure of its statistical significanceas given by the Z score see text. Medium and large software networks share a large amount of motifs but we found larger variability in themedium data set. A remarkable difference is motif 2190 the last motif in the second row, which appeared only in the context of largesoftware systems.

NETWORK MOTIFS IN COMPUTATIONAL GRAPHS: A… PHYSICAL REVIEW E 72, 026107 2005

026107-3

Page 116: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

tation. This is the case of Bi-parallel S904, Bi-fan S204,the feed-forward loop S38, and its close variants such asS2186 and S408. Such a common point might be easilyinterpreted in functional terms: similar subgraphs are abun-dant because they are selected or chosen to perform a givenfunction or task. As shown below, no evidence from statisti-cal patterns supports such view.

Assuming sparse graphs KN, the probability of agiven subgraph i can be estimated. Following Itzkovitzet al., 24 we can see how this is calculated using the firstsubgraph in Fig. 1. Here we have Kj= 2,1 ,1 ,0 and Rj= 0,1 ,1 ,2. The idea is to compute the different probabili-ties associated with each directed edge linking all pairs ofnodes. For example, the probability of having a directed linkfrom node 1 to node 2 for K1RNK is approximately

P1→ 2 =K1R2

NK, 2

which can be interpreted as follows 24: we perform K1attempts for the first node to connect to the target node witha probability R2 /NK. Similarly, we would have

P1→ 3 =K1 − 1R3

NK3

being the approach used for all edges. The average numberof appearances of i is finally computed by averaging.Itzkovitz et al. 24 have shown that the average number ofappearances G of a given subgraph is given by a product ofmoments of different orders of the in-degree, out-degree, andmutual degree distributions:

G Nn−ga−gmKgaMgmj=1

n Ki

kj Ri

rj Mi

mj

i, 4

where ga and gm are the number of single and mutual edges.The approximation assumes uncorrelated, sparse networksKN. Both conditions are met by software maps 25.These mean-field quantities can be used as a null model es-timate of the number of motifs and, eventually, to detectstray, significant deviations form randomness. Since differentmotifs are found in different systems 5, they can actuallyallow us to identify the basic functional blocks for a givenclass of networks.

By exploring our collection of software graphs, we deter-mined G for real nets indicated as Nreal in Fig. 2 andcompared them to Nrand. Here software maps with a size N10 have been analyzed. Two groups have been chosen,involving medium-sized graphs N300 and large graphsN300. The previous set is compared with results fromother networks involving computational tasks. Here previousresults for both gene and neural networks are also shown forcomparison data from 5. The reason for using biologicalnetworks as a reference system is twofold. First, the chosensystems are known to perform computational tasks or can bedescribed by means of an equivalent computational circuit.The second is that it has been conjectured that both naturaland artificial networks might share some commonalities re-lating the mechanisms that shape their evolution 12. Com-

mon features might reflect common functional traits, but alsoas shown below common rules of graph evolution with nospecial functional meaning.

In order to explore the question of how relevant the over-all network structure is in conditioning the frequency ofgiven subgraphs, we should consider the global structure ofthe network. The first approximation is to consider the de-gree of heterogeneity as provided by the distribution of links.Software systems have a well-defined scale-free indegreedistribution

Pik =i − 1

k01−i

k + k0−i, 5

with i2. A mean value i=2.09±0.05 has been obtainedby averaging over all the systems studied here. The distribu-tion of scaling exponents is strongly peaked around i=2.The out-degree distribution Pok is much steeper and seemsbetter described by a broad scale distribution, not far fromthe exponential limit. This is actually the opposite situationconsidered in 24 but is not difficult to show that it is es-sentially symmetric in the theoretical treatment.

For the regime considered here, it was shown that Gfollows a scaling law

G N. 6

Specifically, for a given pair n ,g and a given scaling expo-nent, we have

G Nn−g+s−i+1, 7

where s is the maximum in-degree for our case. This scalingis actually valid for 2is+1.

Four examples of the observed scaling laws are shown inFig. 4 for different software motifs. Using Eq. 3 the ex-pected number of times a given motif appears would scale asGNs+1−i and using the scaling exponent i

2.09±0.06 a scaling law GN would be predicted foruncorrelated, sparse graphs. Here we use our set of systems1

whose size will be indicated as ni i=1, . . . ,83 number ofnodes of each graph. For convenience, we order the systemsby increasing size i.e., nini+1. If Gni is the number oftimes a given subgraph appears in the ith system of size ni,we should expect a scaling relation Gnini

. In order toreduce the noise level we will use the cumulative distribution

GcumN = niN

Gni . 8

The cumulative distribution should scale as GNc withc+1. As shown in Fig. 4, the predicted scaling is recov-ered from real data, thus indicating that the average trendsare consistent with the expectation from random scale-freenetworks. It confirms the validity of the prediction ofItzkovitz et al.24 and its agreement with a set of real net-works. This agreement is an interesting result, particularly if

1Although a total of 83 systems have been used, the presence of aspecific subgraph is size dependent. Not all systems exhibit all sub-graphs: for small software maps some subgraphs are absent.

S. VALVERDE AND R. V. SOLÉ PHYSICAL REVIEW E 72, 026107 2005

026107-4

Page 117: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

we remember that this is a designed system to perform givenfunctions. The fact that we obtain the scaling law expectedfor the random, scale-free graph reveals that the observedscaling in motif abundances are a consequence of top-downconstraints derived from graph evolution.

IV. DUPLICATION-BASED EVOLUTION

The topology of software architecture emerges from de-signed evolution. On top of the process, there must be a basicbuilding plan towards a final function or set of functions. Theengineer foresees the outcome of its work. But there are anumber of strong constraints no less important and operatingthrough the software building process. On the one hand,modular structures are shaped through parallel paths of evo-lution. Different blocks will be involved in more specificsubfunctions. On the other hand, increased complexity leadsto conflicts between different subparts. This is reflected, forexample, at the topological level: small software maps tendto display tree structure, whereas larger systems typicallydisplay much more complex patterns 16. The commonoverall structure detected in software graphs in terms of thedegree distribution and other average properties suggeststhat the final topological patterns might be strongly con-strained.

We conjecture that the abundance of subgraphs in soft-ware networks relates to universal mechanisms of network

growth underlying their evolution. Real software maps tendto display motif generalizations or subgraphs having anstructure comprising many replicas of the four motifs ob-served here. These structures are highly redundant. This sug-gests a very simple duplication-based mechanism of sub-graph generation. New modules depend upon other modulesin order to provide useful functionalities. And it seems rea-sonable to assume that similar modules will share a largenumber of module dependences. In a related software engi-neering study 26, structural similarities in C software atthe module class level have been analyzed. They havefound quantitative evidence of structural duplications. How-ever, they did not provide any model explaining the origin ofduplications.

Figure 5 shows a detailed example illustrating how top-down duplication works in software development. Imaginewe want to add a new software module representing thequeen, in the previous chess-playing program see Fig. 2a.First, we will add a new module declaration, which is con-veniently named queen see Fig. 5a. Because a queen is atype of chessman, it seems reasonable to make this moduledepend upon the same modules referenced by similar mod-ules, which in this case is the pawn. By using the pawnmodule and its neighborhood as a template, we add an inher-itance relationship from the queen to the chessman see Fig.5b. Duplication is completed with the addition of a col-laboration relationship from queen to move see Fig. 5c.Comparison between final network see Fig. 5c with theinitial network see Fig. 2 reveals a new biparallel subgraphand twice the number of bifan subgraphs.

FIG. 4. Scaling in the number of appearances of a given motifagainst network size. Here four common motifs each indicatedhave been considered over the sample set 24. Here we have S904with n=4, g=4, and s=2; S472, with n=4, g=5, and s=2; S206,with n=4, g=5, and s=2; S2186, with n=4, g=4, and s=3. Thepredicted exponents using the average scaling exponent for thein-degree distribution i2.1 would be S904=0.9, S472=S206=0.1, and S2186=2.1, respectively. Using the cumu-lative number of graphs, Gcum see text, we obtain cS904=1.86±0.16, cS472=0.97±0.07, cS206=1.18±0.17, andcS2186=3.12±0.11, in good agreement with the predicted val-ues. The fit was made using least squares on a log-log scale.

FIG. 5. From a to c, an illustration of the duplication mecha-nism in software map evolution. Time flows from top to bottom.Here, a new module queen is introduced by cloning the links of thesimilar module pawn. Every stage displays the evolving C pro-gram right and its corresponding software network left recon-structed by the method described in Sec. III. New text is enclosed ina box. Note how duplication of links in the software map is parallelto duplication of code in the C program.

NETWORK MOTIFS IN COMPUTATIONAL GRAPHS: A… PHYSICAL REVIEW E 72, 026107 2005

026107-5

Page 118: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

An example of this process taking place in real softwaredevelopment is shown in Fig. 6. Here a given subsysteminside a growing software graph is displayed at different de-velopment stages. Duplication of nodes seems to be at work,as well as further removal of many links associated with agiven hub. From c to d a duplication of the hub involvingmany incoming links has taken place together with somefurther node addition. From d through i, it is evident how alarge number of new classes were added by copying the pat-tern of single nodes connected to two central hubs. There isextensive rewiring in some stages, such as in f , where thelower hub losses a large fraction of in-links. Moreover, thereis also the addition of new connections between existingnodes see h→ i. The whole sequence spans 1 year of de-velopment. The main observation from this example whichis a typical one is that node duplication plus rewiring, par-ticularly link removal, is widespread. This is also the case inthe evolution of cellular networks 6.

Examples like the previous one suggest that duplication-divergence growth is the cause of the observed subgraphabundances in software maps. This hypothesis can be testedby comparing the distribution of subgraphs in real networkswith those obtained with a stochastic model of networkgrowth based on asymmetric duplication-divergence rules,previously described in 6. First, an initial random or back-bone network of m0N nodes is created. This randomgraph is generated by the addition of nodes with degree k0=2, every link pointing to a random target node 27. Thisbackbone posses a treelike structure as occurs with software

maps at the beginning of their evolution. Starting from thisbackbone, we apply the following rules at each iteration ofthe model.

i Duplication. A randomly chosen target node v iscloned, and the new node w attaches to all the neighbors ofthe target node.

ii Divergence. For each pair of original and redundantlinks remove one of them with probability .

iii Cross linking. In addition, the target and new nodeare linked w→v with probability . This rule is importantin order to generate triads or 3-subgraphs.

In spite of the simple set of rules implicit in the duplica-tion model, the frequencies of subgraphs obtained from ourin silico system are remarkably close to those seen in theirreal counterparts. In Fig. 7, we have compared the concen-tration of 4-subgraphs expressed in various software net-works and the concentration of 4-subgraphs predicted withthe duplication-based evolution model. These plots were ob-tained with the following method. We generate 400 graphs,100 for each of four different software networks: Blender,Filezilla, GTK, and Exult 28. Each synthetic graph has thesame number of links L and number of nodes, N, as mea-sured in the corresponding software map and no further con-straints are imposed. The parameter space is sampled uni-formly. Once the synthetic networks are obtained, weperform a 4-subgraph census by counting the number of ap-pearances of each 4-subgraph i in the model and in thesynthetic network. Notice that we do not test for statisticalsignificance as in the motif analysis. Instead, our compari-son test is based solely on raw subgraph counts. In order tocompare the two systems, the raw number of subgraphs ofsize 4 is computed and the concentration C of subgraphsevaluated. Here, the concentration is simply the number ofappearances of the 4-subgraph over the total number of4-subgraphs found.

In Fig. 7, each point represents the pairCobserved ,Cpredicted of observed and predicted concentrationsfor given 4-subgraph i. Specifically, we display the set of

FIG. 6. A real instance of software network growth from a well-defined subsystem of Prorally 16 showing duplication. Evolutiongoes from top to bottom and left to right. Only the largest connectedcomponent is displayed here. The figure shows how the target hubin c has been duplicated in d both nodes highlighted with adotted box. Many duplicated nodes involve less connected targetssee g and h.

FIG. 7. Comparison of observed and predicted from a duplica-tion model 4-motif concentrations for a Blender, b Filezilla, cGTK, and d Exult here concentrations are rescaled by 10−3.The exponents for the least-squares fit are a =0.94±0.12, b =0.92±0.13, c =0.96±0.11, and d =1.14±0.12, respectively.

S. VALVERDE AND R. V. SOLÉ PHYSICAL REVIEW E 72, 026107 2005

026107-6

Page 119: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

pairs and the power law fit CpredCobs . Despite fluctuations,

the simple duplication model presented here predicts reason-ably well the concentration of common software networkmotifs: the value of the exponent is reasonably close to 1 inall cases. This is remarkable, given the oversimplificationconsidered here and given the limited constraints imposed tothe selected model graphs to be compared with the real ones.The error bars grow as less common subgraphs are used. Ifwe restrict ourselves to C10−3, the exponent becomesmuch closer to 1. Specifically, we obtain now a =0.96±0.11, b =0.97±0.10, c =0.98±0.12, and d =1.06±0.18, respectively. Consistently with previous work8, less common subgraphs are typically more dense havemore links. In Fig. 8 an example of this correlation is shownfor Exult. Using a frequency-rank plot of 4-subgraphs, wecan see that subgraphs with high frequencies have few linkswhereas higher ranks small frequencies are associated withdense subgraphs.

V. DISCUSSION

In this paper we have analyzed the statistical patterns ofnetwork motifs in a large set of software diagrams. Softwaremaps have been previously shown to be scale free and dis-play small-world behavior 16,18,21 but no previous analy-ses focused on the small-scale architecture. The main goal ofour study is to explore the relevance of graph evolution inrelation with true functionality. Our study actually suggeststhat dynamical rules, with little relation to underlying func-tional constraints, largely determine the frequency of motifsin software graphs.

By using recent theoretical and numerical methods tomeasure and characterize network motifs, we have found thefollowing.

i A number of network motifs are obtained, the mostcommon being shared with other natural systems involvingcomputational traits, such as genetic and neural networks.

ii The number of appearances of a given network motifscales as GNn−g+s−i+1, in agreement with previous cal-culations for random graphs with scale-free degree distribu-tions. This result is supported by previous observations of theuncorrelated character of software maps.

iii Evidence from software evolution suggests that du-plication and rewiring, as occurs with some cellular net-works, might play a key role in shaping software maps. Us-ing a previous model of network growth by duplication anddiversification, it has been shown that it fits rather well thefrequencies of the appearance of network motifs.

Previous studies have proposed the idea that network mo-tifs seem to define the minimal, meaningful building blocksof network complexity. Perhaps not surprisingly we oftenfind them as the basic structures associated with specificfunctional traits, from computation to pattern formation. Theformer is exemplified by feed-forward loops, a three-elementmotif found in genetic regulatory systems 22,29. The latteris actually a particularly relevant example. However, sincethe statistical distribution of network motifs involves dealingwith large numbers of different subgraphs, the question ofhow motifs in general might reflect functional traits requiresthe formulation of appropriate null models of graph evolu-tion. Such models must ignore any functional trait in order totest the possibility that the global properties of networkstructure such as graph heterogeneity might strongly influ-ence what we should expect.

The model chosen here has been a duplication-rewiringone 6. These models have been shown to generate hetero-geneous graphs with many properties close to relevant bio-logical systems such as protein-protein interaction maps.Network heterogeneity is largely due to effective preferentialattachment. Additionally, the rules of duplication stronglybias the types of motifs to be formed towards some specialsubsets. The final consequence is that the patterns of networkmotifs generated by the duplication model might be able toexplain in statistical terms the observed abundances of mo-tifs, with no further requirement of functional constraints.The fact that biological systems, also involved in performingcomputations, have common motifs might support this view.Although sharing common motifs seems to call for commonfunctionalities, it is important to remember that biologicalstructures are largely generated through tinkering 10,11.Protein interaction networks grow by gene duplication, andneural networks also experience increases of cell numberstogether with wide synaptic changes. Perhaps the commontraits are a by-product of the common tinkered evolutionbased on extensive reuse and copy of available structures.

One final comment concerns with the common subgraphsalso shared by digital circuits. They are not obtained, strictlyspeaking, through a process of duplication and rewiring. Al-though the way complex circuits are built does include someamount of reuse,2 considerations involving low cost in linksare of fundamental importance. In spite of such constraints, ithas been shown that electronic circuits have small-worldstructure and are also highly heterogeneous 30. Previouswork seems to indicate that optimal design towards efficient

2As circuit complexity increases both in terms of number of com-ponents and computational tasks it becomes more difficult to de-sign from scratch choosing sets of small gates and building optimal,low-cost circuits. Predefined gates involving well-known andsometimes complex input-output functions are widely used andassembled together. In that sense, some amount of re-use is at work.

FIG. 8. Frequency-rank distribution of network subgraphs in asoftware network . Here the most frequent subgraph has rank 1,the second has rank 2, etc. The frequency Pr of a subgraph withrank r decays rapidly with subgraph rank. An interesting feature isthat most common subgraphs are sparser than less common ones,which are more dense.

NETWORK MOTIFS IN COMPUTATIONAL GRAPHS: A… PHYSICAL REVIEW E 72, 026107 2005

026107-7

Page 120: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

communication at low cost can generate scale-free, heteroge-neous architectures 31,32. Such result suggests again thatnetwork heterogeneity might pervade motif abundances 33.

ACKNOWLEDGMENTS

We thank Maggie Fitzgerald, Frankie Dunn, and EddieScrap for useful input. We also thank Shalev Itzkovitz for a

careful reading and comments on an earlier version of themanuscript. The analysis of network motifs has been doneusing available free software from Uri Alon’s Lab seehttp://www.weizmann.ac.il/mcb/UriAlon/index.html. This workhas been supported by Grant No. FIS2004-05422 and by theEU within the 6th Framework Programme under ContractNo. 001907, “Dynamically Evolving, Large Scale Informa-tion Systems” DELIS.

1 S. N. Dorogovtsev and J. F. F. Mendes, Evolution of Networks:From Biological Nets to the Internet and WWW Oxford Uni-versity Press, New York, 2003.

2 R. Albert and A. L. Barabási, Rev. Mod. Phys. 74, 47 2002.3 M. E. J. Newman, SIAM Rev. 45, 167 2003.4 S. Bornholdt and G. Schuster, Handbook of Graphs and Net-

works, edited by Wiley-VCH, Berlin, 2002.5 R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii,

and U. Alon, Science 298, 824 2002.6 R. V. Solé, R. Pastor-Satorras, E. D. Smith, and T. Kepler,

Adv. Complex Syst. 5, 43 2002; A. Vazquez, A. Flammini,A. Maritan, and A. Vespignani, Complexus 1, 38 2003; R.Pastor-Satorras, E. D. Smith, and R. V. Solé, J. Theor. Biol.222, 199 2003; J. Kim, P. L. Krapivsky, B. Kahng, and S.Redner, Phys. Rev. E 66, 055101 2002; K.-I. Goh, B. Kahng,and D. Kim, e-print q-bio.MN/0312009, v2; W. Banzhaf and P.Dwigth Kuo, J. Biol. Phys. Chem. 4, 85 2004.

7 R. V. Solé and P. Fernandez unpublished. See also R. Gui-mera, M. Sales-Pardo, and L. A. N. Amaral, Phys. Rev. E 70,025101 2004.

8 A. Vázquez et al. Proc. Natl. Acad. Sci. U.S.A. 101, 17942004.

9 B. Hayes, Am. Sci. 89, 204 2001.10 F. Jacob, Science 196, 1161 1976.11 D. Duboule and A. S. Wilkins, Trends Genet. 14, 54 1998.12 R. V. Solé, R. Ferrer, J. M. Montoya, and S. Valverde,

Complexity 8, 20 2002.13 U. Alon, Science 301, 1866 2003.14 A. V. Aho, Science 303, 27 2004.15 A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles,

Techniques and Tools Addison-Wesley Longman, Boston,1986.

16 S. Valverde, R. Ferrer-Cancho, and R. V. Solé, Europhys. Lett.60, 512 2002.

17 C. F. Kemerer and S. Slaughter, IEEE Trans. Software Eng.25, 493 1999.

18 S. Valverde and R. V. Solé unpublished.19 B. Stroustrup, The C Programming Language Addison-

Wesley, Reading, MA, 1986.20 E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design

Patterns Elements of Reusable Object-Oriented SoftwareAddison-Wesley, New York, 1994.

21 C. R. Myers, Phys. Rev. E 68, 046116 2003.22 S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, Nat. Genet.

31, 64 2002.23 J. Lakos, Large Scale C Software Design Addison-

Wesley, New York, 1996.24 S. Itzkovitz, R. Milo, N. Kashtan, G. Ziv, and U. Alon, Phys.

Rev. E 68, 026127 2003.25 All software maps analyzed here and others studied by other

authors have been shown to be sparse. Correlations have beenalso analyzed in R. V. Solé and S. Valverde, in Complex Net-works, edited by E. Ben-Naim, H. Frauenfelder, and Z. Toroc-zkai, Lecture Notes in Physics Springer, Berlin, 2004, pp.169–190. Using statistical measures derived from informationtheory, it was shown that software maps are considerably un-correlated.

26 F. Fioravanti, G. Migliarese, and P. Nesi, in Proceedings of the23rd International Conference on Software Engineering(ICSE’01), IEEE, May 12–19, Toronto, 2001, edited by HausiA. Müller IEEE, New York, 2001.

27 D. S. Callaway, J. E. Hopcroft, J. M. Kleinberg, M. E. J. New-man, and S. H. Strogatz, Phys. Rev. E 64, 041902 2001.

28 The source code is available at the following web sites: http://www.blender.org Blender, http://filezilla.sourceforge.netFilezilla, http://www.gtk.org GTK, and http://exult.sourceforge.net Exult.

29 S. Mangan and U. Alon, Proc. Natl. Acad. Sci. U.S.A. 100,11980 2003.

30 R. Ferrer, C. Janssen, and R. V. Solé, Phys. Rev. E 64, 0461192001.

31 R. Ferrer and R. V. Solé, in Statistical Physics of ComplexNetworks, edited by R. Pastor-Satorras, M. Rubi, and A. Diaz-Guilera, Lecture Notes in Physics Springer, Berlin, 2003, pp.114–125.

32 R. V. Solé and S. Valverde, in Complex Networks, edited by E.Ben-Naim, H. Frauenfelder, and Z. Toroczkai, Lecture Notesin Phyics Springer, Berlin, 2004, pp. 169–190.

33 H. B. Fraser, A. E. Hirsch, L. M. Steinmetz, C. Scharfe, andM. W. Feldman, Science 296, 750 2002.

S. VALVERDE AND R. V. SOLÉ PHYSICAL REVIEW E 72, 026107 2005

026107-8

Page 121: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

4.8. LOGARITHMIC GROWTH DYNAMICS IN SOFTWARE NETWORKS 121

4.8 Logarithmic Growth Dynamics in Software Networks

Page 122: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Europhys. Lett., 72 (5), pp. 858–864 (2005)DOI: 10.1209/epl/i2005-10314-9

EUROPHYSICS LETTERS 1 December 2005

Logarithmic growth dynamics in software networks

S. Valverde1 and R. V. Sole

1,2

1 ICREA-Complex Systems Lab, Universitat Pompeu FabraDr. Aiguader 80, 08003 Barcelona, Spain2 Santa Fe Institute - 1399 Hyde Park Road, Santa Fe, NM 87501, USA

received 16 August 2005; accepted in final form 5 October 2005published online 3 November 2005

PACS. 89.75.-k – Complex systems.

PACS. 89.65.-s – Social and economic systems.

PACS. 05.10.-a – Computational methods in statistical physics and nonlinear dynamics.

Abstract. – In a recent paper, Krapivsky and Redner (Phys. Rev. E, 71 (2005) 036118) pro-posed a new growing network model with new nodes being attached to a randomly selectednode, as well to all ancestors of the target node. The model leads to a sparse graph with an av-erage degree growing logarithmically with the system size. Here we present compeling evidencefor software networks being the result of a similar class of growing dynamics. The predicted pat-tern of network growth, as well as the stationary in- and out-degree distributions are consistentwith the model. Our results confirm the view of large-scale software topology being generatedthrough duplication-rewiring mechanisms. Implications of these findings are outlined.

Introduction. – The structure of many natural and artificial systems can be depictedwith networks. Empirical studies on these networks have revealed that many of them displaya heterogenous degree distribution p(k) ≈ k−γ , where few nodes (hubs) have a large numberof connections while the majority of nodes have one or two links [1]. The existence of hubshas been related to multiplicative effects affecting network evolution [2]. Such topologicalpatterns have been explained by a number of mechanisms, including preferential attachmentrules [3] and network models based on simple rules of node duplication [4]. A very simpleapproach is given by the growing network model with copying (GNC) [5]. The network growsby introducing a single node at a time. This new node links to m randomly selected targetnode(s) with probability p as well as to all ancestor nodes of each target, with probability q(see fig. 1). The discrete dynamics follows a rate equation [5],

L(N + 1) = L(N) +m

N

⟨∑µ

(p + qjµ)

⟩, (1)

where L and N are the number of links and nodes, respectively. The second term in theright-hand side describes the copying process, where the average number of links added isgiven by p + qjµ . The µ index refers to the node µ, to be selected uniformly from amongc© EDP Sciences

Article published by EDP Sciences and available at http://www.edpsciences.org/epl or http://dx.doi.org/10.1209/epl/i2005-10314-9

Page 123: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde et al.: Logarithmic growth dynamics in software networks 859

Fig. 1 – (A) Illustration of the copying rule used in the network growth model. Each node is labeledwith a number indicating its age (number one is the oldest). In the figure, new node v6 attaches totarget node v4 with probability p. This new node inherits every link from the target node (dashedlinks), with probability q. (B) Synthetic network obtained with the GNC model with N = 100,m = 1, p = 1 and q = 1. (C) Synthetic network obtained with the GNC model with N = 100, m = 4,p = 0.25 and q = 0.25. These networks have a scale-free in-degree distribution and an exponentialout-degree distribution.

the N elements. Assuming a continuum approximation, the number of links is driven by thefollowing differential equation:

dL

dN= mp + mq

L

N. (2)

The asymptotic growth of the average total number of links depends on the extent ofcopying defined by the product mq. In particular, logarithmic growth is recovered when mq =1 and L(N) = mpN log N . This corresponds to a marginal situation separating a domain oflinear growth (mq < 1) to a domain of exponential growth (2 > mq > 1). Interestingly,for mq = 1 the GNC model predicts a power law in-degree distribution Pi(k) ≈ k−γi withexponent γi = 2 and an exponential out-degree distribution Po(k), independently of copyingparameters. Actually, their derivation for the in-degree distribution can be generalized forarbitrary q and p values, leading to a scaling law Pi(k) ≈ k−2 for the parameter domain ofinterest. In ref. [5] the authors showed that the GNC model seems to consistently explain thepatterns displayed by citation networks. Here, we show that a GNC model is also consistentwith the evolution of software designs, which also display the predicted logarithmic growth.

Software networks. – One of the most important technological networks, together withthe Internet and the power grid, is represented by a broad class of software maps. Softwareactually pervades technological complexity, since the control and stability of transport systems,Internet and the power grid largely rely on sotfware-based systems. In spite of the multiplicityand diversity of objectives and functionalities addressed by software projects, we have pointedout the existence of strong universals in their internal organization [6]. Computer programsare often decomposed in a collection of text files written in a given programming language. Inthis paper, we will study computer programs written in C or C++ [7]. Program structure canbe recovered from the analysis of program files by means of a directed network representation.

Page 124: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

860 EUROPHYSICS LETTERS

In a software network, software entities (files, classes, functions or instructions) map ontonodes and links representing syntactical dependencies [6]. Class graphs (also called “logicaldependency graph” [8]) are a particular class of software networks that has been shown tobe small-world and scale-free network with an exponent γ ≈ 2.5 [6, 9, 10]. Interestingly, thefrequencies of common motifs displayed in class graphs can be predicted by a very simpleduplication-based model of network growth [11]. This result indicates that the topology oftechnological designs, in spite of being planned and problem-dependent, might actually emergefrom common, distributed rules of tinkering [12]. In the following, we provide further evidencefor the importance of duplication processes in the evolution of software networks.

Here, we study a new class of software networks. We use the so-called “include graph”(or “physical dependency graph” in [8]) G = (V,E), where vi ∈ V is a program file and adirected link (vi, vj) ∈ E indicates a (compile-time) dependency between file vi and vj . In Cand C++, such dependencies are encoded with the keyword “#include” followed by the nameof the refereed source file [8]. In order to recover the include graph, we have implemented anetwork reconstruction algorithm that analyses the contents of all files in the software projectlooking for this reserved keyword. Every time this keyword is found in a file vi, the nameof the refereed file vj is decoded and a new link (vi, vj) is added. No other informationis considered by the network reconstruction algorithm. Notice that the include network isunweighted because it makes no difference to include the same file twice.

In this paper, we investigate the structure and evolution of software maps by lookingat their topological structure and the time series of aggregate topological measures, such asnumber of nodes N(t), number of links L(t) or average degree k(t) = L(t)/N(t). It is worthmentioning that the number of nodes in a include graph coincides with the number of files inthe software project, which is often used as a measure of project size.

Software maps typically display asymmetries in their in- and out-degree distributions [9,10] although the origins of such asymmetry remained unclear. Notice how the out-degreeand in-degree distributions of real include networks are quite similar to the correspondingdistributions obtained with the GNC model (see previous section). The in-degree and out-degree distributions for the largest component of two different systems (see fig. 2A) are shown

in figs. 2B and C, where we have used the cumulative distributions P>(k) =∞∫k

P (k)dk. In

both cases, in-degree distributions display scaling Pi(k) ≈ k−γi , where the estimated exponentis consistent with the prediction from the GNC model, whereas out-degree distributions aresingle-scaled (here the average value for the systems analysed is 〈γi〉 = 2.08 ± 0.04 [13]). Asshown in the next section, these stationary distributions result from a logarithmic growthdynamics consistent with the GNC model.

Software evolution. – Although an extensive literature on software evolution exists (see,for example, [14, 15], little quantitative predictions have been presented so far. Most studiesare actually descriptive and untested conjectures about the nature of the constraints actingon the growth of software structures abound. It is not clear, for example, if the large-scalepatterns are due to external constraints, path-dependent processes or specific functionalities.In order to answer these questions, we have compared real software evolution with models ofnetwork growth, where software size is measured as the number of nodes in the correspondinginclude graph. In this context, the assumptions of the GNC model are consistent with obser-vations claiming that code cloning is a common practice in software development [15]. Indeed,comparison between real include graphs and those generated with the GNC model suggeststhe extent of copying performed during software evolution is a key parameter that explains theoverall pattern of software growth. Such a situation has been also found in class diagrams [11].

Page 125: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde et al.: Logarithmic growth dynamics in software networks 861

A

100

101

102

103

k

10-3

10-2

10-1

100

Cum

ulat

ive

Dis

trib

utio

n

100

101

102

k

10-2

10-1

100

γ ci = 0.97

iγ = 1.22

c

B C

Fig. 2 – (A) The largest connected component of the XFree86 include network at 15/05/1994 (withN = 393) displays scale-free behavior (see text). In (B), the cumulative distributions Pi>(k) andPo>(k) are shown for a more recent version of the XFree86 include network with N = 1299 (not shownhere). The power law fit of the in-degree distribution yields Pi(k) ∼ k−γc

i −1 with γci = 0.97 ± 0.01

while the out-degree distribution is exponential. In (C) we can notice similar features for the in-degreeand out-degree distributions of the Aztec include network at 29/3/2003. For this system, the powerlaw fit of the in-degree distribution yields an exponent γc

i = 1.22 ± 0.03.

The growth dynamics found in include graphs is logarithmic (see fig. 3A), thus indicatingthat we are close to the mq = 1 regime. Indeed, the sparseness seen in software maps is likelyto result from a compromise between having enough dependencies to provide diversity andcomplexity (which require more links) and evolvability and flexibility (requiring less connec-tions). Here we have uneven, but detailed information of the process of software building.In this context, different software projects developments display specific patterns of growth.Specifically, the number of nodes N grow with time following a case-dependent functionalform N = φ(t). Using dL/dt = (dL/dN)(dφ/dt), we have, from eq. (2),

dL

dt=

[mp + mq

L

Φ(t)

]Φ−1 (3)

with a general solution

L(t) = emq∫

(ΦΦ)−1dt

[mp

∫e−mq

∫(ΦΦ)−1dtΦ−1dt + Γ

], (4)

where Γ is a constant. Using a linear law growth (which is not uncommon in software devel-opment), i.e. N(t) = N0 + at, and assuming mq = 1, we have

L(t) = (N0 + at)[mp log

(N0 + at

N0

)+

L0

N0

]. (5)

However, typical time series of L(t) in real software evolution is subject to fluctuations (seefig. 3A). In order to reduce the impact of fluctuations, we use the cumulative average degree

K(t) =t∫0

(L/N)dt, instead. Assuming the number of nodes grows linearly in time, we obtain

K(t) =mp(N0 + at)

a

[log

(N0 + at

N0

)− 1

]+

L0

N0t +

mpN0

a. (6)

Page 126: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

862 EUROPHYSICS LETTERS

0 2×104

4×104

6×104

8×104

1×105

Time (hours)

1000

2000

3000

4000

5000

6000

N(t

), L

(t)

0 5×104

1×105

Time (hours)

103

104

105

Cum

ulat

ive

k(t)

104

105

104

105

A B

C

t1 2t

Fig. 3 – (A) The top curve shows the comparison between the time evolution of number of linksL(t) in XFree86 between 16/05/1994 and 01/06/2005 (points) and the prediction of eq. (5) (dashedline). In the bottom curve we compare the time evolution of system size N(t) and its linear fittingN(t) = N0 + at (dashed line). We observe an anomalous growth pattern followed by a discontinuity(here indicated as t1 and t2) in L(t). Notice how t2 signals a discontinuity both in L(t) and N(t), whilediscontinuity t1 only takes place in L(t). (B) Comparison between time evolution of the cumulativeaverage degree in XFree86 during the same time period as in (A) and the analytic prediction of eq. (6).(C) The inset shows the same data as in (B) but in a double logarithmic plot. The fitting parametersare: N0 = 622.17 ± 10.92, a = 0.0086 ± 0.0002, L0 = 1419.8 ± 4.1, and mp = 2.20 ± 0.01. Time ismeasured in hours.

The above expressions can be employed to estimate the parameters L0 and mp describingthe shape of the logarithmic growth of number of links L(t) and the parameters N0 and a con-trolling the linear growth of the number of nodes N(t). We used the following fitting procedure.For each software project, we have recovered a temporal sequence Gt = (Vt, Et) |0 ≤ t ≤ T of include networks corresponding to different versions of the software project. Time is mea-sured in elapsed hours since the first observed project version (which can or cannot coincidewith the beginning of the project). This temporal sequence describes the evolution of thesoftware project under study. From this sequence, we compute the evolution of the number ofnodes n0, n1, n2, ..., nT , the evolution of the number of links l0, l1, l2, ..., lT and the evolutionof the average degree k0, k2, ..., ki = li/ni, ..., kT . In general, available data is a partial set ofrecords of development histories and often misses the initial project versions corresponding tothe early evolution. Then, t0 = 0 and this explains why the initial observations for n0 and l0are higher than expected. However, we have rescaled time so the first datapoint corresponds tozero. We have collected partial(1) evolution registers for seven different projects (relevant timeperiod is in parenthesis): XFree86 (16/5/94–1/6/05), Postgresql (1/1/95–1/12/04), DCPlus-Plus (1/12/01–15/12/04), TortoiseCVS (15/1/01–1/6/05), Aztec (22/3/01–14/4/03), Emule(6/7/02–26/7/05) and VirtualDub (15/8/00–10/7/05) [13]. The full database comprises 557include networks (see table I).

Then, we proceed as follows. First, for each software project, its time series for the numberof nodes is fitted under the assumption of linear growth, i.e. N(t) = N0+at, and thus yieldingN0 (initial number of nodes) and a (rate of addition of new files). In table I , we can appre-

(1)Actually, these datasets constitute a coarse sampling of the underlying process of software change. Collect-ing software evolution data at the finest level of resolution requires a monitoring system that tracks automati-cally all changes made by programmers. Instead, it is often the programmer who decides when a software regis-ter is created. The issue of fine-grained sampling is an open research question in empirical software engineeringthat deserves more attention. These limitations preclude us from a more direct testing of the GNC model.

Page 127: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

S. Valverde et al.: Logarithmic growth dynamics in software networks 863

Table I – Predictions of eq. (6) for different systems.

Project a N0 mp L0 T

XFree86 0.0086 ± 0.0001 622.17 ± 10.92 2.20 ± 0.01 1419.80 ± 4.09 243Postgresql 0.0066 ± 0.0002 601.42 ± 11.35 1.78 ± 0.05 243.89 ± 8.46 31DCPlusPlus 0.004 ± 0.0001 101.51 ± 2.42 0.70 ± 0.03 338.96 ± 1.30 74TortoiseCVS 0.0057 ± 0.0001 97.57 ± 2.62 1.59 ± 0.02 105.76 ± 1.58 107Aztec 0.026 ± 0.002 205.12 ± 22.17 0.97 ± 0.03 622.61 ± 4.77 14Emule 0.016 ± 0.0006 98.01 ± 6.37 1.65 ± 0.11 223.80 ± 9.34 54VirtualDub 0.0079 ± 0.0004 167.04 ± 12.44 1.34 ± 0.05 381.50 ± 5.16 35

ciate that the majority of projects grow at a rate a proportional to 10−3 files/hour while twomedium size projects (Aztec and Emule) actually grow by an order of magnitude faster. Next,we compute the time series of cumulative average degree K(t) by integrating numerically thesequence of kt values. This new sequence will be fitted with eq. (6) in order to estimate the pa-rameters L0 (initial number of links) and the product mp controlling the extent of duplication.

In fig. 3B we show the result of the previous fitting procedure to the time series of cu-mulative average degree K(t) in XFree86, a popular and freely re-distributable open sourceimplementation of the X Windows System [13]. As shown in the figure, the agreement betweentheory and data is very good. We have validated the same logarithmic growth pattern in theevolution of other software systems (see table I). In particular, we provide a prediction forthe average number of links to target nodes, mp, which is found to be small. This is againexpected from the sparse graphs that are generated through the growth process.

Together with the overall trends, we also see deviations from the logarithmic growth fol-lowed by reset events. In fig. 3A we can appreciate a pattern of discontinuous software growthin the number of links L(t) for XFree86. The time interval delimited by t1 and t2 is the signa-ture of a well-known major redesign process that enabled 3D rendering capabilities in XFree86.This new feature of XFree86 was called Direct Rendering Infrastructure (DRI). Developmentof DRI is cleary visible in the time series of L(t). At t1 (i.e., August 1998) the design of DRIwas officially initiated and the event t2 (i.e., July 1999) corresponds to the first public release ofthe DRI technology (i.e., DRI 1.0) [16]. A careful look at the time series L(t) shows that beforethe discontinuities (indicated by t1 and t2), some type of precursor patterns were detectable.

The above example suggests how deviations from the logarithmic growth pattern can pre-dict future episodes of costly internal reorganization (so-called refactorings [17]). In XFree86,the integration of DRI was a costly redesign process characterized by an exponential growthpattern in the number of links L(t). This accelerated growth pattern starts at t1 and finishesin a clearly visible discontinuity (indicated here by t2) that signals a heavy removal of links.After t2 we observe a pattern of fast recovery eventually returning to the logarithmic trenddescribed by eq. (5) (dashed lines in fig. 3A). Such type of reset pattern has been also foundin economic fluctuations in the stock market [18]. This trend needs to be explained and mightactually result from conflicting constraints leading to some class of marginal equilibrium state.This is actually in agreement with the patterns of activity change displayed by the communityof software developers (unpublished results) which also exhibits scale-free fluctuations.

∗ ∗ ∗

We thank our colleague J. F. Sebastian for useful suggestions. This work has beensupported by grants BFM2001-2154 and by the EU within the 6th Framework Program undercontract 001907 (DELIS) and by the Santa Fe Institute.

Page 128: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

864 EUROPHYSICS LETTERS

REFERENCES

[1] Dorogovtsev S. N. and Mendes J. F. F., Evolution of Networks: From Biological Nets tothe Internet and WWW (Oxford University Press, New York) 2003.

[2] Barabasi A.-L. and Albert R., Science, 286 (1999) 509.[3] Caldarelli G., Capocci A., De Los Rios P. and Munoz M. A., cond-mat/0207366 v2

(2002).[4] Sole R. V., Pastor-Satorras R., Smith E. D. and Kepler T., Adv. Complex Syst., 5

(2002) 43; Vazquez A., Flammini A., Maritan A. and Vespignani A., Complexus, 1 (2003)38; Pastor-Satorras R., Smith E. D. and Sole R. V., J. Theor. Biol., 222 (2003) 199; Kim

J., Krapivsky P. L., Kahng B. and Redner S., Phys. Rev. E, 66 (2002) 055101.[5] Krapivsky P. L. and Redner S., Phys. Rev. E, 71 (2005) 036118.[6] Valverde S., Ferrer-Cancho R. and Sole R. V., Europhys. Lett., 60 (2002) 512.[7] Stroustrup B., The C++ Programming Language (Addison Wesley) 1985.[8] Lakos J., Large Scale C++ Software Design (Addison-Wesley, New York) 1996.[9] Valverde S. and Sole R. V., Santa Fe Inst. Working Paper, SFI/03-07-044 (2003).

[10] Myers C. R., Phys. Rev. E, 68 (2003) 046116.[11] Valverde S. and Sole R. V., Phys. Rev. E, 72 (2005) 026107.[12] Sole R. V., Ferrer R., Montoya J. M. and Valverde S., Complexity, 8 (2002) 20.[13] XFree86 (http://www.xfree86.org); Postgresql (http://www.postgresql.org); DCPlusPlus

(http://dcplusplus.sourceforge.net); TortoiseCVS (http://www.tortoisecvs.org); Aztec(http://aztec.sf.net); Emule (http://www.emule-project.net); VirtualDub (http://www.virtualdub.org).

[14] Belady L. A. and Lehman M. M., IBM Systems J., 15 (1976) 225.[15] Godfrey M. and Tu Q., Proceedings of the 2001 International Workshop on Principles of

Software Evolution (IWPSE-01), Vienna (2001).[16] Hammel M. J., Linux Magazine, December (2001).[17] Fowler M., Beek K., Brant J. and Opdyke W., Refactoring (Addison-Wesley, Boston)

1999.[18] Sornette D. and Johansen A., Physica A, 245 (1997) 411.

Page 129: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Chapter 5

Summary of main results

As the conclusion of this thesis, we present here the main achievements obtained by fol-lowing the research agenda that has been outlined in the first chapter.

1. We have compared artificial design and natural evolution on the basis of statisticalregularities observed in the topology of their products (Sole et al., 2002). We havedeveloped network measurements based on Information Theory (Sole and Valverde,2004). The application of these metrics to artificial and natural networks indicatesthe existence of strong constraints, suggesting that the space of possible networksis actually rather constrained.

Related papers:

• R. V. Sole, R. Ferrer-Cancho, J. M. Montoya and S. Valverde, “Selection, Tin-kering and Emergence in Complex Networks”, Complexity 8, 20-33 (2002).

• R. V. Sole and S. Valverde, “Information theory of complex networks: onevolution and architectural constraints”, In: Networks: Structure, Dynamicsand Function, Lecture Notes in Physics, Springer-Verlag, 169-190 (2004).

2. Our lattice models of transport of information exhibit nontrivial scaling properties atthe onset of criticality, which reproduce some of the observed real Internet features,like self-similarity of time series fluctuations. In addition, we observe maximuminformation transfer and efficiency at the critical point separating the congested andfree flow phases (Sole and Valverde, 2001). Our results strongly indicate that su-perposition of very large number of highly heterogeneous sources (i.e., infinite vari-ance) is not the unique explanation to Internet self-similarity. Indeed, self-similartraffic can be spontaneously reached when there is an active regulation of activ-ity towards the critical point (Valverde and Sole, 2002) and with independence oftopological features.

Related papers:

Page 130: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

130 CHAPTER 5. SUMMARY OF MAIN RESULTS

• R. V. Sole and S. Valverde , “Information transfer and phase transitions in amodel of Internet Traffic”, Physica A 289, 595-605 (2001).

• S. Valverde and R. V. Sole, “Self-organized critical traffic in parallel computernetworks”, Physica A 312, 636-648 (2002).

3. We have first extended lattice models of network traffic to a realistic Internet modelwith spatially heterogeneous topology. This enhanced model allowed us to explorethe impact of topology in routing dynamics. Congestion seems to be an inevitableresult of user’s behavior coupled to the network dynamics and its effects shouldbe minimized by choosing appropriate routing algorithms. We have reported theexistence of a critical path horizon defining a transition from low to highly efficienttraffic (Valverde and Sole, 2004). Probability distribution of Internet end-to-endperformance is recovered at the critical path horizon. This transition is actually aconsequence of Internet’s small-world architecture exploited by the routing algo-rithm. In addition, an analysis of fluctuations at the critical path horizon stronglysupports the endogenous origin of Internet dynamics.

Related papers:

• S. Valverde and R. V. Sole “Internet’s Critical Path Horizon”, European PhysicsJournal 38(2), 245-252 (2004).

4. This is the first time that software structure has been shown to be small-world andscale-free with exponent γ ∼2.5 (Valverde, Ferrer-Cancho and Sole, 2002). In or-der to assess the universality of such feature we have collected the largest datasetemployed in a network study to date (Valverde and Sole, 2003). Indeed, the expo-nent is universal and its value largely independent of functional and organizationdetails. Software structure is typically sparse and their components are very-wellconnected through a short chain of relationships, probably a consequence of con-scious and optimal design that seeks to minimize the interactions between softwarecomponents.

Related papers:

• S. Valverde, R. Ferrer-Cancho and R. V. Sole, ”Scale free networks from op-timal design”, Europhysics Letters 60, 512-517 (2002).

• S. Valverde and R. V. Sole, ”Hierarchical Small Worlds in software architec-ture”, submitted to IEEE Transactions in Software Engineering (2005). AlsoSanta Fe Institute working paper SFI/03-07-044.

5. Our studies indicate that network motifs act like fingerprints of development pro-cesses. Our recent study of software motifs (Valverde and Sole, 2005) has shownthat both motif frequencies and heterogeneous software topology are largely a con-sequence of simple rules of building rooted in the tinkering behavior of its human

Page 131: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

131

developers. As a consequence, we do not require the assumption of conscious andoptimal design in order to explain the scale-free behavior of software.

Related papers:

• S. Valverde and R. V. Sole, “Network motifs in computational graphs: a casestudy in software architecture”, Physical Review E 72, 26107 (2005).

6. Based on our findings, we have successfully modeled the evolution of software ar-chitecture. The scale-free invariant observed in software systems has allowed us topredict the growth of software systems, where the number of links is constrained tofollow a logarithmic trend (Valverde and Sole, 2005b). We have shown that devia-tions from such trend anticipate costly changes (i.e., refactorings). The model alsoexplains the typical in-degree and out-degree asymmetry observed in real softwarearchitectures.

Related papers:

• S. Valverde and R. V. Sole , “Logarithmic growth dynamics in software net-works” Europhysics Letters, 72 (5), pp. 858-864 (2005).

Page 132: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

132 CHAPTER 5. SUMMARY OF MAIN RESULTS

Page 133: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Chapter 6

Glossary

Average Path Length. The mean shortest path between all nodes in a network.

Betweenness. The number of shortest paths that the focal node lies on.

Bipartite Network. A network with two distinct types of nodes.

Bug. A computer bug is an error, flaw, mistake, failure or fault in a computer programthat prevents it from working correctly or produces an incorrect result. Bugs arise frommistakes and errors, made by people, in either a program’s code or its design.

Class. In object-oriented programming, a class encapsulates data objects as well as themethods which manipulate the data; such methods are sometimes described as ”class be-haviour”.

Closeness. The mean shortest path between a focal node and all other nodes in the net-work.

Component. A group of nodes that is mutually interconnected.

Code. See Source Code.

Degree/Connectivity. The number of edges that connect the focal node to other nodes.

Degree Distribution. The frequency distribution of the individual node degree for anentire network.

Diameter. See Average Path Length.

Directed Graph. Nodes in a directed graph are connected by an asymmetric relation-ship, such as predation.

Page 134: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

134 CHAPTER 6. GLOSSARY

Distance. See Average Path Length.

Edge. Interacting nodes are connected by edges; every edge is described with an un-ordered pair of two nodes.

Encapsulation. In computer programming, the process of combining elements to cre-ate a new software entity. For instance, a subroutine is a type of encapsulation because itcombines a series of machine instructions.

Graph Theory. A branch of mathematics dealing primarily with the statistical descrip-tion of static networks.

Host. Node sink and source of computer traffic.

Hub. A node having many connections.

Latency. Delay it takes for a packet to get from one designated node to another in anetwork.

Link. Nodes in a directed graph are connected by links; every link is described withan ordered pair of two nodes. The ordering indicates the direction of the asymmetric re-lationship; such as information flow.

Long-tailed Distribution. Any degree distribution that decreases more slowly than ex-ponential over a portion of the range.

Method. In computer programming, a method is a synonym for action, procedure, func-tion or subroutine. In obtect-oriented programming, it is a named sequence of instructionsresponding to certain messages.

Motif. A small subgraph within a network.

Node. An individual element within a network.

Object-oriented Programming. A computer programming paradigm where the com-puter program is composed of a collection of individual units, or objects, as opposed to atraditional view in which a program is a list of instructions to the computer. Each objectis capable of receiving messages, processing data, and sending messages to other objects.

Packet. Minimum unit of information transmitted in Internet.

Power Spectrum. Analysis technique describing a signal as a plot of its power against

Page 135: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

135

frequency.

Poisson Degree Distribution. A network formed by randomly connecting a fixed numberof nodes has a Poisson degree distribution. Such symmetrical distribution is characterizedby a modal hump at the mean degree with exponentially decreasing tails.

Power-law Degree Distribution. A network with a degree distribution described bya long-tailed distribution; also called scale-free distributions because there is no modalhump.

Preferential Attachment Model. The formation of a network by connecting nodes toheavily connected nodes.

Probability Distribution. In mathematics, a probability distribution assigns to everyinterval of the real numbers a probability, so that the probability axioms are satisfied.

Programmer. In computing, a programmer is someone who does computer program-ming and develops computer software. A programmer may be considered a softwareengineer or software developer.

Programming Language. This is a standardized communication technique for express-ing instructions to a computer. It is a set of syntactic and semantic rules used to definecomputer programs. Examples of programming languages are C, C++, BASIC, Java orFORTRAN.

Network. A mathematical object having a set of nodes and a set of pair nodes, callededges.

Reuse. A computer programming paradigm in which one writes small bits of code toaccomplish a common task. The same code can then be reused in a later project, savingthe programmer time and energy.

Router. A node that stores and forwards packets to their destinations.

Shortest Path. The path that traverses the minimum number of edges between two nodes.

Small-World. A network displays small-world behaviour if its average path length issmall relative to its size and its clustering coefficient is higher than random expectation.

Source Code. Is any series of statements written in some human-readable computerprogramming language.

Subroutine. In computer science, a subroutine (function, procedure or subprogram) is

Page 136: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

136 CHAPTER 6. GLOSSARY

a sequence of instructions which performs a specific task, as part of a larger program.Subroutines can be ”called”, thus allowing programs to access the subroutine repeatedlywithout the subroutine’s code having been written more than once.

Undirected Graph. Nodes in an undirected graph are connected by a symmetric rela-tionship, such as physical interactions.

Waiting Time. See Latency.

Page 137: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

Bibliography

P. Alberch. The logic of monsters: Evidence for internal constraint in development andevolution. Geobios, 19:21–57, 1989.

M. Argollo de Menezes and A.-L. Barabasi. Fluctuations in network dynamics. Phys.Rev. Lett., 92(2), 2004.

A.-L. Barabsi and R. Albert. Emergence of Scaling in Random Networks. Science, 286:509–512, 1999.

S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks. Oxford University Press,New York, 2003.

R. Ferrer i Cancho, C. Janssen, and R. V. Sol. The topology of technology graphs: smallworld pattern in electronic circuits. Phys. Rev. E, 64:046119, 2001.

K. Fukuda, H. Takayasu, and M. Takayasu. Spatial and temporal behavior of congestionin the internet. Fractals, 7(1):23–31, 1999.

B.C. Goodwin. How the Leopard Changed Its Spots: the Evolution of Complexity. CharlesScribner’s Sons, New York, 1994.

S. J. Gould. The Stucture of Evolutionary Theory. Belknap, Harvard, 2003.

H. Inose. Communication networks. Sci. Am., 3:117–128, 1972.

F. Jacob. Evolution and tinkering. Science, 196:1161–1166, 1976.

S. Kauffman. The Origins of Order: Self-Organization and Selection in Evolution. OxfordUniv. Press, New York, 1993.

S. Kauffman and S. Levin. Towards a general theory of adaptive walks on rugged land-scapes. J. Theoret. Biol.

P. Krapivsky and S. Redner. Network growth by copying. Phys. Rev. E, 71:036118, 2005.

G. R. McGhee. Theoretical Morphology: The Concept and Its Application. ColumbiaUniv. Press, New York, 1999.

Page 138: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

138 BIBLIOGRAPHY

K. J. Niklas. Plant Allometry: The Scaling of Form and Process. Chicago Univ. Press,Chicago, 1994.

K. J. Niklas. Adaptive walks through fitness landscapes for early vascular land plants.American Journal of Botany, 84(1):16–25, 1997.

R. Pastor-Satorras and A. Vespignani. Evolution and Structure of the Internet: A Statisti-cal Physics Approach. Cambridge Univ. Press, Cambridge, 2004.

R. Pastor-Satorras, E. D. Smith, and R. V. Sol. Evolving protein interaction networksthrough gene duplication. J. Theor. Biol., 222:199–210, 2003.

I. Rodriguez-Iturbe and A. Rinaldo. Fractal River Basins: Chance and Self-Organization.Cambridge University Press, New York, 1997.

R. V. Sole and S. Valverde. Information theory of complex networks: On evolution and ar-chitectural constraints. In Springer-Verlag, editor, Networks: Structure, Dynamics andFunction, Lecture Notes in Physics. E. Ben-Naim, H. Frauenfelder, and Z. Toroczkai,Berlin, 2004.

R. V. Sole and S. Valverde. Information transfer and phase transitions in a model ofinternet traffic. Physica A, 289:595–605, 2001.

R. V. Sole, R. Ferrer i Cancho, J. M. Montoya, and S. Valverde. Selection, tinkering andemergence in complex networks. Complexity, 8:20–33, 2002a.

R. V. Sole, R. Pastor-Satorras, E. Smith, and T. Kepler. A model of large-scale proteomeevolution. Adv. Complex Systems, 5:43–54, 2002b.

T. Standage. The Victorian Internet: The Remarkable Story of the Telegraph and theNineteenth Centurys On-Line Pioneers. Walker Publishing Company, New York, 1998.

S. Valverde and R. V. Sole. Internet’s critical path horizon. European Physics Journal B,38(2):245, 2004.

S. Valverde and R. V. Sole. Self-organized critical traffic in parallel computer networks.Physica A, 312:636–648, 2002.

S. Valverde and R. V. Sole. Network motifs in computational graphs: A case study insoftware architecture. Phys. Rev. E, 72:026107, 2005.

S. Valverde and R. V. Sole. Logarithmic growth dynamics in software networks. Euro-phys. Lett., 72(5):858–864.

S. Valverde, R. Ferrer i Cancho, and R. V. Sole. Scale free networks from optimal design.Europh. Lett., 60:512–517, 2002.

Page 139: Evolution and Dynamics in Information Networkscomplex.upf.es/~sergi/thesis_svalverde.pdfprimer curs em va impressionar per la seva saviesa i entusiasme generos. Vaig veure en´ ...

BIBLIOGRAPHY 139

D. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440–442, 1998.

W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson. Self-similarity through high-variability: statistical analysis of ethernet lan traffic at the source level. IEEE/ACMTrans. on Networking, 25(4):100–113, 1995.

S.-H. Yook, H. Jeong, and A.-L. Barabasi. Modeling the internet’s large-scale topology.Proc. Natl. Acad. Sci. USA, 99:13382, 2002.