KTH Computer Science and Communication

28
KTH Computer Science and Communication CSC

Transcript of KTH Computer Science and Communication

Page 1: KTH Computer Science and Communication

KTH Computer Scienceand Communication

CSC

S hool of Computer S ien e and Communi ationTRITA-NA-P0513 � CBN � ISSN 0348-2952

Computational Biology and Neuro omputing

Massively parallel simulation ofbrain-s ale neuronal network modelsMikael Djurfeldt, Christopher Johansson, Örjan Ekeberg,Martin Rehn, Mikael Lundqvist, and, Anders Lansner

Page 2: KTH Computer Science and Communication

Mikael Djurfeldt, Christopher Johansson, Orjan Ekeberg,Martin Rehn, Mikael Lundqvist, and, Anders LansnerMassively parallel simulation ofbrain-scale neuronal network models

Report number: TRITA-NA-P0513, CBNPublication date: December 2005E-mail of author: [email protected]

Reports can be ordered from:

Computational Biology and NeurocomputingSchool of Computer Science and Communication (CSC)Royal Institute of Technology (KTH)SE-100 44 StockholmSWEDEN

telefax: +46 8 790 09 30http://www.csc.kth.se/

Page 3: KTH Computer Science and Communication

Massively parallel simulation of

brain-scale neuronal network models

Mikael Djurfeldt, Christopher Johansson, Orjan Ekeberg,

Martin Rehn, Mikael Lundqvist, and, Anders Lansner

KTH – School of Computer Science and Communication

and

Stockholm University

SE-100 44 Stockholm

SWEDEN

December 23, 2005

Abstract

Biologically detailed computational models of large-scale neuronal networks have now become feasibledue to the development of increasingly powerful massively parallel supercomputers. We report hereabout the methodology involved in simulation of very large neuronal networks. Using conductance-basedmulticompartmental model neurons based on Hodgkin-Huxley formalism, we simulate a neuronal networkmodel of layers II/III of the neocortex. These simulations, the largest of this type ever performed,were made on the Blue Gene/L supercomputer and comprised up to 8 million neurons and 4 billionsynapses. Such model sizes correspond to the cortex of a small mammal. After a series of optimizationsteps, performance measurements show linear scaling behavior both on the Blue Gene/L supercomputerand on a more conventional cluster computer. Results from the simulation of a model based on moreabstract formalism, and of considerably larger size, also shows linear scaling behavior on both computerarchitectures.

1

Page 4: KTH Computer Science and Communication

2

Page 5: KTH Computer Science and Communication

Contents

1 Introduction 51.1 Computational modeling in neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Could computer science learn from neuroscience? . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Level of abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Why do we need large-scale models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Parallel neuronal network simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Cluster computers 6

3 The SPLIT simulator 73.1 Original design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Optimizing SPLIT for cluster computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Stricter adherence to the MPI standard . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Identifying scaling problems—memory and time complexity O(ns) . . . . . . . . . . . 83.2.3 Distributing creation of synapses and setting of their parameters to slave processes . . 83.2.4 The call-back mechanism—a conceptually simple way to parallelize user code . . . . . 93.2.5 Hiding parallelism—the connection set algebra . . . . . . . . . . . . . . . . . . . . . . 93.2.6 The cell map abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Optimizing SPLIT for Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.1 Porting SPLIT to Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.2 A scaling problem in model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.3 Adding a barrier call to the SPLIT API . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.4 A scaling problem during simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.5 Removing data structures with memory complexity O(nc) . . . . . . . . . . . . . . . . 113.3.6 Implementation of a new protocol for specification of data logging . . . . . . . . . . . 123.3.7 Implementation of a distributed data logging mechanism for binary data . . . . . . . . 12

4 A scalable network model of the neocortex 134.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 An abstract model of neocortex 175.1 Bayesian Confidence Propagating Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 Three implementations of the training phase . . . . . . . . . . . . . . . . . . . . . . . 185.1.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2.1 Running times for a mouse sized network . . . . . . . . . . . . . . . . . . . . . . . . . 195.2.2 Scaling of Hypercolumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.3 Scaling of Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.4 A network model of record size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Discussion 216.1 Trade-off between numbers and complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2 Number of free parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.3 Cost of complex cell models in large scale simulations . . . . . . . . . . . . . . . . . . . . . . 236.4 Mapping the model onto the Blue Gene/L torus . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Conclusions 23

8 Acknowledgments 23

3

Page 6: KTH Computer Science and Communication

4

Page 7: KTH Computer Science and Communication

1 Introduction

Biologically detailed computational models of large-scale neuronal networks have now become feasible dueto the development of increasingly powerful massivelyparallel supercomputers. In this report, we describethe methodology involved in simulations of a neu-ronal network model of layers II/III of the neocortex.The model uses Hodgkin-Huxley-style formalism andcomprises millions of conductance-based multi-com-partmental neurons and billions of synapses. Suchmodel sizes correspond to the cortex of a small mam-mal. We also describe techniques involved in simu-lating even larger cortex-like models based on a moreabstract formalism.

After some background and discussion of conceptsrelevant to the methodology of large and full scaleneuronal network simulation, section 2 describes thekind of supercomputer architectures we have used forour simulations, while section 3 shows how we haveadapted our software tool to these computers andto large problem sizes. Section 4 briefly describesthe scalable cortical attractor memory network modelused in our performance tests and shows examplesof scaling results obtained. Section 5 describes andshows results for the more abstract cortex-like model.The report is concluded with a discussion in section6 and conlusions in section 7.

1.1 Computational modeling inneuroscience

Computational modeling has come to neuroscienceto stay. As modeling techniques become increasinglyintegrated with experimental neuroscience research,we will see more knowledge extracted from existingexperimental data. Quantitative models help to ex-plain experimental observations and seemingly unre-lated phenomena. They assist in generating experi-mentally testable hypotheses and in selecting infor-mative experiments.

Super computers are becoming evermore power-ful. In this report we show that it is possible to simu-late a substantial part of a small mammal’s neocortexwith fairly biologically detailed compartmental neu-ron models. This type of models are essential whenlinking cognitive functions to their underlying neuro-physiological correlates. They are becoming an im-portant aid in achieving a better understanding ofnormal function as well as psychiatric and neurolog-ical disorders.

1.2 Could computer science learnfrom neuroscience?

The nature of the neocortex is inherently parallel,since it is based upon a large number of small com-puting elements. Furthermore, neural systems, andneocortex in particular, are good examples of very ad-vanced and robust, massively parallel, systems thatperform much better on a great number of tasks thantoday’s algorithms. This suggests that models of neu-ral systems, especially their more abstract versions,are interesting to study also from a computer scienceperspective.

1.3 Level of abstraction

When modeling any physical system, there is a choiceof which state variables and which parameters to in-clude. In this sense, a model can focus on particularaspects of reality and model those aspects with a cer-tain level of detail. For example, should we model thecurrents through individual species of ion channels inthe cell membrane, or is a more abstract model, wherewe only consider the average state of activity in thecell, sufficient? The choice is mainly governed by thetype of questions we want to be able to address.

In computational neuroscience, the term “biolog-ically detailed” typically refers to models which em-ploy conductance based multi-compartmental modelneurons with a number of different ion channel typesrepresented. Abstract network models, on the otherhand, are often based on connectionist type gradedoutput units. In this case, a single unit in the modelmay even represent multiple neurons, like a cluster ofcells in a cortical minicolumn.

When we consider the methodological problemsinvolved in simulating very large networks, especiallyproblems of parallelization, there are many similari-ties between the detailed and abstract modeling ap-proaches. In this report, we therefore study bothtypes of models and characterize some of the optionsfor designing simulators with a good scaling perfor-mance on cluster computers with a large number ofnodes.

1.4 Why do we needlarge-scale models?

Real vertebrate neuronal networks typically comprisemillions of neurons and billions of synaptic connec-tions. They have a complex and intricate structureincluding a modular and layered layout at several lev-els, e.g., cortical minicolumns, macrocolumns, andareas. It can be useful to model a single module ormicrocircuit in isolation, for example in relation to a

5

Page 8: KTH Computer Science and Communication

correspondingly reduced experimental in vitro prepa-ration such as a cortical slice. But such a module is,by definition, a component in a much larger network.In a real nervous system it is embedded in a mosaicof similar modules and receives afferent input frommultiple other sources. We may therefore need toconsider an entire network of modules.

However, computational studies have, partly dueto limited computing resources, often focused on thecellular and small network level, e.g., the emergenceand dynamics of local receptive fields in the primaryvisual cortex. Modeling at the network level posesspecific challenges, but it is clear that it is inevitableto take global and dynamic network interactions intoaccount in order to understand the functioning of aneuronal system.

Natural dynamics

It is indeed hopeless to understand a global brain net-work from modeling a local network of some hundredcells, like a cortical minicolumn, or from dramati-cally sub-sampling the global network by letting onemodel neuron represent an entire cortical column orarea. One reason is that using small or sub-samplednetwork models lead to unnatural dynamics.

The network model is comprised of model neuronstuned after their real counterparts. These model neu-rons need sufficient synaptic input current to becomeactivated. In a small network, model neurons arebound to have very few presynaptic neurons. Thus,either it is necessary to exaggerate connection proba-bilities, or, synaptic conductances—most of the timeboth.

This results in a network with few and strong sig-nals circulating, in stark contrast to the real corticalnetwork, where many weak signals interact. Such dif-ferences tend to significantly distort the network dy-namics. For example, artificial synchronization caneasily arise, which is a problem especially since syn-chronization is one of the more important phenomenaone might want to study. By modeling the full net-work with a one-to-one correspondence between realand model neurons such problems are avoided.

Computo-pharmacology

In biologically detailed large-scale neuronal networkmodels, the network effects of pharmacological ma-nipulations can be studied in great detail. Does theaddition of the drug alter the firing patterns of a par-ticular neuron? What happens to the activity of alarge ensemble of neurons? It might even be possibleto study functional properties, such as memory func-tion, and the enhancement of these. Other types of

manipulations are possible, such as the addition orremoval of different cell types, or, the rearrangementof connectivity patterns.

From ion channels to phenomenology

Large-scale network models will help to bridge thegap between the neuronal and synaptic phenomenaand the phenomenology of global brain states and dy-namics as observed in extracellular recordings, LFP,EEG, MEG, voltage-sensitive dyes, BOLD etc. Butlarge-scale models demand tremendous amounts ofcomputing power.

In summary, computational neuroscience will ben-efit greatly from the current development of new af-fordable massively parallel computers, likely to enterthe market in the next few years. This developmentconstitutes an enabling technology for the modelingof complex and large-scale neuronal networks.

1.5 Parallel neuronal networksimulators

The development of parallel simulation in computa-tional neuroscience has been relatively slow. Todaythere are a few publicly available simulators, but theyare far from as general, flexible, and documented ascommonly used simulators like Neuron (Hines andCarnevale, 1997) and Genesis (Bower and Beeman,1998). For Genesis there is PGenesis and the develop-ment of parallel version of Neuron has started. In ad-dition there exists simulators like NCS (Frye, 2005),NEST (Morrison et al., 2005), and our own paral-lelizing simulator SPLIT (Hammarlund and Ekeberg,1998). However, they are in many ways still on theexperimental and developmental stage.

In the following we will describe a parallel simula-tion study using the SPLIT simulator and one usinga more recently developed special purpose simulator.But we start with a description of the computer ar-chitectures on which these simulators have been de-veloped.

2 Cluster computers

The simulations in this report were mainly performedon two supercomputer installations: 1. The DellXeon cluster Lenngren at PDC, KTH, Stockholm—arather conventional cluster computer containing 442dual processor nodes. 2. The Blue Gene/L supercom-puter (hereafter denoted BG/L) at the Deep Com-puting Capacity on Demand Center, IBM, Rochester,USA.

6

Page 9: KTH Computer Science and Communication

BG/L (Gara et al., 2005) represents a new breedof cluster computers where the number of processors,instead of the computational performance of indi-vidual processors, is the key to higher performance.The power consumption of a processor, and, thus,the heat generated, increases quadratically with clockfrequency. By using a lower frequency, the amountof heat generated decreases dramatically. Therefore,CPU chips can be mounted more densely and needless cooling equipment. A system built this way canbe made to consume less material and less power foran equal computational performance.

While a BG/L system can contain up to 65536processing nodes, with 2 processors per node, thelargest simulations described in this report were per-formed on one BG/L rack containing 1024 nodes.Our simulations on Lenngren ran on a maximum of256 nodes. The theoretical peak performance of aprocessor core in BG/L, running at 700 MHz, is 2.8Gflops/s. This presumes using special compiler sup-port for the double floating point unit in the PowerPC440 core. Since we chose not to explore that in thisproject, we should instead count on a theoretical peakperformance of 1.4 Gflops/s. This is to be comparedto 6.8 Gflops/s for one processor in the Lenngren clus-ter. A node in the BG/L cluster has 512 MiB of mem-ory compared to 8 GiB for a node in the Lenngrencluster. A set of BG/L nodes can be configured to runin co-processor mode, where one of the two processorsin a node run application code in most of the availablememory (≈ 501 MiB) while the other processor man-ages communication, and virtual node mode, whereboth processors run application code with half of thememory (≈ 249 MiB) allocated to each. Thus, forour purposes, a Lenngren node has 4.9 times morecomputing power than a BG/L node in virtual nodemode and 16 times more memory.

When running our code, we did not see any signif-icant difference in performance between a set of nodesrunning in co-processor mode and half of that set run-ning in virtual node mode. We therefore only usedco-processor mode in cases where we needed morememory per node.

BG/L has three different inter node communica-tion networks: a 3D torus network for point-to-pointcommunication, a tree structured network for col-lective communication, and a barrier network. Thenodes in the Lenngren cluster are connected with In-finiband which is a two-layered switch fabric.

3 The SPLIT simulator

3.1 Original design

The SPLIT simulator (Hammarlund and Ekeberg,1998) was developed in the mid 90’s with the aimto explore how to efficiently use the resources of vari-ous parallel computer architectures for neural simula-tions. Since different computer architectures benefitfrom different kinds of optimizations, the code usesprogramming techniques which encapsulate such op-timizations and hardware specific features from therest of the program, and the neural network modelspecification in particular. SPLIT has also served asa platform for experiments with communication al-gorithms.

SPLIT comes in the form of a C++ library whichis linked into the user program. The SPLIT API isprovided by an object of the class split which is theonly means of communicating with the library. Theuser program specifies the model using method callson the split object. The user program is serial, andcan be linked with a serial or parallel version of thelibrary. Parallelism is thus completely hidden fromthe user. In the parallel case, the serial user pro-gram runs in a master process which communicates,through mechanisms internal to the SPLIT library,with a set of slave processes. On clusters, SPLITuses PVM or MPI.

The library exploits data locality for better cache-based performance. In order to gain performance onvector architectures, state variables are stored as se-quences. It uses techniques such as adjacency listsfor compact representation of projections and AER(Address Event Representation; Bailey and Hammer-strom (1988)) for spike events.

Perhaps the most interesting concept in SPLIT isits asynchronous design: On a parallel architecture,each slave process has its own simulation clock whichruns asynchronously with other slaves. Any pair ofslaves only need to communicate at intervals deter-mined by the smallest axonal delay in connectionscrossing from one slave to the other.

The cells in a neural model can be distributedarbitrarily over the set of slaves. This gives greatfreedom in optimizing communication so that denselyconnected neurons reside on the same CPU and sothat axonal delays between pairs of slaves are maxi-mized. The asynchronous design, where a slave pro-cess does not need to communicate with every otherslave at each time step, then gives two benefits: 1.By communicating more seldom, the communicationoverhead is reduced. 2. By allowing slave processesto run out of phase, the amount of time waiting forcommunication is decreased.

7

Page 10: KTH Computer Science and Communication

3.2 Optimizing SPLIT for clustercomputers

The recent work on large-scale models, described inthis report, has exposed a set of bottlenecks in theoriginal SPLIT library. This is not unexpected, sincecurrent model sizes were quite unthinkable at thetime when the SPLIT library was conceived. In addi-tion a set of bugs has been discovered and fixed. Wereport below about the major changes to the code.

3.2.1 Stricter adherence tothe MPI standard

The SPLIT library makes extensive use of the point-to-point communication operations MPI Send andMPI Recv. On the architectures on which the originalSPLIT library was developed, smaller messages werebuffered and the MPI Send call did not block. Theoriginal author used this implementation dependentassumption in a few instances in the code. A typicalcase is one where a many-to-many communication isperformed by first posting all sends and then all re-ceives. On several modern architectures, this assump-tion does not hold for message sizes used in SPLIT.Therefore, and in order to remove uncertainties re-garding code portability, we have changed the codein all instances so that it strictly adheres to the MPIstandard. Thus, MPI Send is not assumed to returnbefore a corresponding MPI Recv has been posted. Inthe following case, where the use of MPI Alltoall

was, for reasons having to do with code and classstructure, precluded, this meant implementing thecommunication scheme shown in Figure 1.

If the communication network is not overloadedthe new scheme still has time complexity O(p), wherep is the number of processes. While there certainlyare schemes with better time complexity (Tam andWang, 2000; Gross and Yellen, 1998), the simplic-ity of the chosen one is better from the perspectiveof code maintainability, but may become problem-atic for setup time when the number of processes ap-proaches hundreds of thousands.

3.2.2 Identifying scaling problems—memoryand time complexity O(ns)

In the original design, synapses are created and theirparameters set from the master process. Associatedwith this is a set of data structures with size propor-tional to the number of synapses, ns, giving memorycomplexity O(ns). In models where ns approacheshundreds of millions, the required amount of memorycomes close to what is available on a single process-ing node on many architectures—we have a mem-

// Original code:

for (i = 0; i < n processes; ++i)

// send to process i

for (i = 0; i < n processes; ++i)

// receive from process i

// New code:

for (i = 0; i < local rank; ++i)

// receive from process i

for (i = local rank; i < n processes; ++i)

// send to process i

for (i = 0; i < local rank; ++i)

// send to process i

for (i = local rank; i < n processes; ++i)

// receive from process i

Figure 1: Old and new version of simple commu-nication scheme for all-to-all communication duringsetup.

ory scaling problem. When models include billionsof synapses, it becomes necessary to remove all suchdata structures. This includes data structures forstorage of synaptic parameters before distribution tothe slaves and various kinds of transformation ta-bles used to quickly map distributed vector entities,for example synaptic conductance, from index on themaster node, global index, to the corresponding pro-cess rank and index within that process, local index(see discussion below).

Remember also that, since the original SPLITlibrary confines all model specification to the mas-ter process, model specification is serial. We there-fore also have a scaling problem in the time domainsince the time complexity for synapse setup becomesO(ns). This was clearly noticeable in our simulationswhere synapse setup time soon began to dominatesetup time for larger models.

3.2.3 Distributing creation of synapses andsetting of their parameters to slave pro-cesses

An obvious solution to the memory and time com-plexity problems described above is to distribute thecreation of synapses and setting of their parameters

8

Page 11: KTH Computer Science and Communication

to the slave processes, since setup code would gainspeedup from parallelization and data structures pro-portional to the number of synapses only would needto cover synapses relevant to the neurons representedon a particular slave. This means that we need aconceptually simple way to run user code in the slaveprocesses. Below, we will describe a first attempt atsupporting this. Then we will describe a way to shieldthe user code from issues of parallelization which arisedue to this change. In our tests, these changes re-duced model setup time from hours to minutes forthe larger models. Most importantly, it changed thetime complexity of setup time from O(ns) to O(ns/p)where ns is the number of synapses and p the numberof slave processes.

3.2.4 The call-back mechanism—a conceptu-ally simple way to parallelize user code

In the original design, the user program, running inthe master process, specifies all parameters and initi-ates all stages of simulation through predefined func-tions in the SPLIT API. In the new design, a call-back mechanism introduces the possibility of execut-ing arbitrary code at the slaves, by allowing the userto define functions which can be called in all slaveprocesses at the command of the user program. Us-ing such functions, callbacks, the user can freely mixserial and parallel sections in the main program.

In order not to deal directly with slave processes,a neural network model is conceived as consisting ofa number of parts, each governing the update of thestate variables of a disjoint subset of neurons in themodel. Such a part is represented as a simulationunit, which, in the parallel case, is handled by oneslave process.

A simulation unit object, split unit, provides asubset of the SPLIT API normally provided by thesplit object. A callback receives a pointer to therelevant split unit object as argument. The userprogram can also supply an arbitrary amount of datain the callback call, which is then distributed by thecall-back mechanism to all slaves.

3.2.5 Hiding parallelism—the connection setalgebra

According to SPLIT’s design, parallelism should behidden from the user, both conceptually and practi-cally. A user program is able to arbitrarily link witheither a serial or parallel version of the SPLIT library.How can we, when synapse setup is distributed, main-tain a clean separation between issues of the modelitself and issues of parallelism so that the user code

only needs to be concerned with the model? In par-ticular, we don’t want the user code to deal with howcells in the model are distributed over the set of slaveprocesses.

The solution to this problem involves the abstrac-tion connection set and, the simulation unit abstrac-tion and call-back mechanism described above. Thesynaptic connectivity is defined by the user in a call-back, in a way which hides parallelism by using ex-pressions from a connection set algebra. Since this isa novel construction, it will now be described in somedetail.

A connection set object in essence represents aninfinite connection matrix. In practice, it is used torepresent a type of connection structure. The con-nection set object obtains the sets of pre- and postsy-naptic neurons it governs through the simulation unitobject. Conceptually, this acts as cutting out the rel-evant piece of the infinitely large connection matrix.Subsequently, the connection set object can work asan iterator over existing connections and can be usedrepeatedly in the user program to create synapses andset their parameters.

A basic set of connection sets are provided withthe SPLIT library. This includes full connectivity, di-agonal structure, uniformly random connectivity (ex-istence of a connection determined by a fixed proba-bility), and, Gaussian structure. The user can imple-ment his own connection structures by inheritance,but a much simpler and more expressive method isprovided in the connection set algebra which can beconceived as a set of operators on infinite connec-tion matrices. As a simple example, consider the casewhere we want a 70% probability that a neuron con-nects to any other neuron in a population, but noconnection to itself. This can now be expressed as:

random_uniform (0.7) - diag ()

where random uniform and diag are connection setobjects.

With the connection set abstraction, the user canfocus on the connectivity structure itself while detailssuch as which subset of neurons the code is work-ing with is completely hidden. Note also how thesimulation unit abstraction and call-back mechanismaid in hiding parallelism—the user does not need todeal with how many simulation units there are, orhow neurons are distributed between them. Thesenovel mechanisms have low overhead and are cacheefficient since they do not depend on large tables inmain memory.

9

Page 12: KTH Computer Science and Communication

3.2.6 The cell map abstraction

To specify synapses, communicate spike events, etc.,neurons need to be identified. An axon should con-nect a certain presynaptic neuron to a certain post-synaptic neuron. A spike is destined for a certainneuron. As mentioned above, SPLIT uses global andlocal indexes to refer to neurons. This design was in-fluenced by two factors: 1. The best way to identify aneuron is task dependent. When specifying synapses,we most likely want to refer to neurons as members ofa population in the model being simulated. However,when communicating spikes, it is more practical torefer to neurons as members of population fragmentslocated in particular slave processes. Thus, we canreduce complexity if we are allowed to use multipleways to address neurons. 2. SPLIT is designed toallow for arbitrary mappings of neurons to slave pro-cesses. Thus, we need a granularity at the neuronlevel when determining to which slave a neuron be-longs. The global index of a neuron is used to locateit within the context of the entire population while itslocal index locates it within a particular slave. Usingglobal indexes, we can avoid mixing the complexityof the synaptic connection structure with the com-plexity of the mapping of the model onto the slaves.

As indicated above, in the original code, the mas-ter process is responsible for transforming betweenglobal and local indexes. When synapse setup is dis-tributed, the slaves need to be made aware of how thecells in the model are distributed over the slave pro-cesses. To this end, we have devised an abstraction,cell map, which handles index transformations bothin the direction from global to local index and the re-verse. During setup, when the user program suppliesthe mapping from neurons to slaves, the cell map ob-ject takes this information and distributes itself to theslaves. In each slave, the local cell map object thenbuilds the tables it needs for efficient index trans-formations. The ultimate goal is that the cell map

object will handle all functions associated with themapping from neurons to slaves. This has not yetbeen fully implemented.

3.3 Optimizing SPLIT forBlue Gene/L

In this section, we will describe our experiences withthe BG/L architecture and the changes we made toSPLIT in order to achieve good scaling behavior.

Before running on the BG/L architecture our ex-perience was based on running the updated versionof SPLIT, described above, on cluster computers atPDC, KTH. Although scaling results on the Lenngrencluster were slightly unpredictable, likely due to the

communication network being shared with other ap-plications in a queue system, they appeared good inthat setup time seemed approximately constant forruns on different number of nodes, while speedup forsimulation time was roughly linear.

BG/L represented a challenge in the sense thatwe now had to start imagining simulations using tensof thousands of processors while our previous prac-tical experience was limited to less than 200. Theother main consideration was the smaller amount ofmemory available to each processor.

3.3.1 Porting SPLIT to Blue Gene/L

The port of SPLIT to BG/L was trivial. It basi-cally consisted of finding the correct arguments forthe GNU configure script (compiler flags, library lo-cations etc). This is noteworthy, considering that theBG/L is a novel computer architecture, and has twomajor reasons: 1. The software stack on the BG/Lnodes is based on GNU software, for example GLIBC,and provides a Linux-like run-time environment. 2.The BG/L specific inter node communication hard-ware is abstracted using the MPI API.

3.3.2 A scaling problem in model setup

The first scaling results for a model with 330000 neu-rons and 161 million synapses showed that setup timescaled linearly with the number of processors used onBG/L (see upper trace in Figure 2). This was unex-pected, since our results on Lenngren so far indicateda constant setup time. We then performed runs onLenngren with up to 384 processors and could indeedconfirm a linear dependence on the number of MPIprocesses (Figure 2, third trace). Because SPLITslave processes run asynchronously, it was difficult topin down exactly which code region was responsible,but initial debugging indicated that the setup of cellparameters was the culprit. Since this part of modelsetup is computationally intensive and involves muchcommunication between master and slaves, and wasalready identified as a scaling problem of the futuredue to its time complexity O(nc), it was natural todistribute cell setup as we had previously done withsynapse setup. This resulted in the second trace inFigure 2. The linear scaling still remained, however.

3.3.3 Adding a barrier call to the SPLIT API

In order to isolate the problem, we added a barriercall to the SPLIT API which synchronize processesusing MPI Barrier:

split->barrier ()

10

Page 13: KTH Computer Science and Communication

0

200

400

600

800

1000

1200

1400

1600

50 100 150 200 250 300 350 400 450 500

setu

p tim

e in

sec

onds

number of MPI tasks

BG/LBG/L dist cell param

Dell clusterBG/L comm scheme

Figure 2: Scaling of setup time for a model with 330000 neurons. Upper trace: Initial result on BG/L. 2ndtrace: Scaling after distributing setup of cell parameters to the slave processes. 3rd trace: Initial result onDell cluster. 4th trace: Final result, after correcting problem in communication scheme implementation.Setup time now decreases with increasing number of processors.

The implementation involved extending the internalcommunication protocols in SPLIT, but the effortwas well motivated by its utility for debugging. Byusing the barrier call to synchronize processes be-tween each call to the SPLIT API, the scaling prob-lem could be located to one particular member func-tion (map). This made us discover that the imple-mentation of the communication scheme for connect-ing spike communication objects, a simplified versionof which is shown in Figure 1, did not work as in-tended, but introduced unintended dependencies be-tween sender-receiver pairs. After correcting this, weobtained the scaling scaling curve shown in the fourthtrace in Figure 2.

3.3.4 A scaling problem during simulation

The upper trace in Figure 3 shows timings for one sec-ond of simulated time in a model with 330000 neuronsand 161 million synapses. Note that the addition ofmore processors does not lead to decreased simulationtime for simulations on more than 767 processors. Inorder to pin down this problem, we linked SPLIT withan MPI trace library provided by IBM. In addition togiving detailed logs of communication events, this li-brary provides various kinds of statistics of communi-

cation during a simulation. In a histogram over sizesof communication packets, and the blocking time as-sociated with those messages, we could observe thatpackets of size 80 bytes were over-represented in num-ber and responsible for a large fraction of simulationtime. With this clue, we could identify the causeof the scaling problem: The normal state of asyn-chronous operation of the slaves was repeatedly inter-rupted by synchronization caused by the code whichcollects membrane potential data for logging to fileat the master process. By decreasing the frequencyof communication of such logging data, we could ver-ify that this was indeed the problem (lower trace inFigure 3).

3.3.5 Removing data structures with mem-ory complexity O(nc)

With the above scaling problems resolved, we couldnow run models consisting of up to 4 million neurons.At this point, we encountered a memory scaling prob-lem. The SPLIT library still contained many datastructures in the master process, proportional to thenumber of neurons, nc. This presented a barrier forthe absolute size of model we could handle. Someof these structures were related to index transforma-

11

Page 14: KTH Computer Science and Communication

0

250

500

750

1000

1250

1500

1750

2000

200 300 400 500 600 700 800 900 1000

sim

ulat

ion

time

in s

econ

ds

number of MPI tasks

InitialNo central logging

Figure 3: Scaling of simulation time for a model with 330000 neurons. Upper trace: Initial result. No speedupabove 767 MPI tasks. Lower trace: Final result, after decreasing frequency of communication associatedwith central logging of membrane potentials.

tions, as described above, and could be eliminated,but this was only part of the problem, as we will seein section 3.3.6.

3.3.6 Implementation of a new protocol forspecification of data logging

A thorough review of data structures in the masterprocess with regard to memory complexity showedthat the largest amount of memory was consumedby the mechanisms for specifying which model statevariables should be logged to file. The reason wasthat this mechanism operated at the level of individ-ual state variables. We therefore implemented a novelmechanism where logged variables could be specifiedat the level of sets of variables. This change, which in-cluded extensions to the internal communication pro-tocol in SPLIT, together with the changes describedin the previous section, substantially improved mem-ory scaling. Without further changes, the library isnow capable of handling models consisting of tenthsof millions of cells and tenths of billions of synapses.

3.3.7 Implementation of a distributed datalogging mechanism for binary data

Since we are interested in synthesizing various mea-sures of activity within the cortical sheet, like EEG,

we need an efficient way to log large amounts of datato file. For example, one of our experiments, wherewe simulate 20 seconds of activity of a model with82500 neurons, generates 10 billion data points.

There are three problems associated with this: 1.Since all previous mechanisms for data logging wasbased on the master process collecting data from theslaves, the concentration of data to the master pro-cess, and the serialization that this implies, wouldlead to overloading of the master, and time would belost due to slave processes waiting for communica-tion. 2. The high rate of data to be collected fromthe slave would lead to disturbance of the normalasynchronous state of the slaves, as described above.3. The previous mechanisms for data logging used atext file format, which would be too voluminous.

A new data logging mechanism was therefore im-plemented, which writes binary data to the BG/Lparallel file system, at each slave. Data can be loggedfor an arbitrary number of state variables for subsetsof neurons and to files specified by the user program.File names are suffixed with the slave rank. We havewritten tools to locate and extract specific traces ofdata points from these files, and to convert them forfurther processing in tools like MatLab.

12

Page 15: KTH Computer Science and Communication

4 A scalable network model ofthe neocortex

The model described in this section is the result ofan integration of functional constraints given by atheoretical view of the neocortex as an associativeattractor memory network (a top-down strategy; seeChurchland and Sejnowski (1992), ch. 1), and, empir-ical constraints given by cortical anatomy and phys-iology (bottom-up).

The view of the cortex as an attractor network hasits origin more than fifty years back in Hebb´s theo-ries of cell assemblies (see, e.g, Fuster (1995) for a re-view). It has been mathematically instantiated in theform of the Willshaw-Palm (Willshaw and Longuet-Higgins, 1970; Palm, 1982) and Little-Hopfield mod-els (Hopfield, 1982) and has subsequently been elab-orated on and analyzed in great detail (Amit, 1989;Hertz et al., 1991).

The olfactory cortex (Haberly and Bower, 1989)as well as the hippocampus CA3 field (Treves andRolls, 1994) have previously been percieved and mod-eled as prototypical neuronal auto-associative attrac-tor memory networks. More recently, sustained ac-tivity in an attractor memory of a similar kind hasbeen proposed to underlie prefrontal working mem-ory, although in this case the attractor state itselfand not the connectivity matrix is assumed to holdthe memory (Compte et al., 2000).

Our view of cortical associative memory has beenexpressed in the form of an abstract neural networkmodel (Lansner et al., 2003; Sandberg et al., 2002,2003). An implementation of this model is describedin section 5 of this report. The computational units inthe model correspond to the anatomical minicolumns,suggested by Mountcastle (1978) and described byPeters and Sethares (1991), and to the orientationminicolumns in the primary visual cortex.

In a previous series of papers we used the ideasfrom the abstract model in a biologically detailedmodel of a memory network architecture in layersII/III of the neocortex (Fransen and Lansner, 1995,1998; Lansner and Fransen, 1992). We found that themodel could perform critical cell assembly operationssuch as pattern retrieval, completion and rivalry.

However, the abstract framework also suggestsmodularity in terms of a hypercolumnar organiza-tion of a similar kind to that described by Hubel andWiesel (Hubel and Wiesel, 1977) for the primary vi-sual cortex, that is, bundles of about 100-200 mini-columns forming one hypercolumn. In the presentmodel, we have continued the development of the bi-ologically detailed model by adding a hypercolumnarstructure. We have also added a second type of in-

Figure 4: Cartoon of a network with 12 hypercolumnsand 8 minicolumns each. Each hypercolumn has 8circularly arranged minicolumns, each comprising 30densely connected model pyramidal cells. The smalldisc in each minicolumn represents 2 RSNP cells thatreceive input from distant pyramidal cells and inhibitlocal ones. The large disc at the center of each hyper-column represents a population of basket cells. Thenegative feedback circuitry between pyramidal andbasket cells is indicated. An excitatory long-rangeminicolumn-to-minicolumn connection originates inpyramidal cells in the presynaptic minicolumn andtargets pyramidal cells in the postsynaptic minicol-umn. An inhibitory minicolumn-to-minicolumn con-nection originates in the same way but targets theRSNP cells in the postsynaptic minicolumn, which, inturn, provide a di-synaptic inhibition onto the pyra-midal cells in the receiving minicolumn.

terneuron and model the hypercolumn in full scale,meaning that there is a one-to-one correspondencebetween model neurons and real neurons of the se-lected types in layers II/III of the minicolumn.

As in the abstract model, each hypercolumn oper-ates like a soft winner-take-all module in which activ-ity is normalized among the minicolumns, such thatthe sum of activity is kept approximately constant.We propose that the lateral inhibition mediated bybasket cells may achieve this normalization in cor-tex. Normalization models of a similar kind have re-cently been proposed for the primary visual cortex(Blakeslee and McCourt, 2004).

Since our model is built using the connection setalgebra tool described in section 3.2.5, it is easily scal-able. That is, we can easily adjust structure param-eters such as the number of minicolumns per hyper-

13

Page 16: KTH Computer Science and Communication

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure 5: Spatial layout of minicolumns in a networkwith 9 hypercolumns (centers marked with a cross),each comprising 100 minicolumns. Each small discrepresents one minicolumn. Minicolumns with thesame color belong to the same hypercolumn.

column, or the total number of hypercolumns in thenetwork. Our model contains scaling equations forconnection probabilities which ensure that differentcell types see similar currents in an attractor statefor different model size and structure. For naturalstructure parameters, connection probabilities corre-spond to empirical estimates of cortical connectivity(see Lundqvist et al. (2006) for details). Figure 4shows, in an abstract fashion, the structure of a ver-sion of our network model consisting of 12 hyper-columns with 8 minicolumns each (all simulations inthis report use 100 minicolumns per hypercolumn ex-cept where noted).

The cell models used in the simulations describedhere are conductance based and multi-compartmentalof intermediate complexity . They are to a large ex-tent similar to those developed previously (Fransenand Lansner, 1995, 1998). The cells included are lay-ers II/III pyramidal cells and two different types of in-hibitory interneurons, assumed to correspond to hor-izontally projecting basket cells and vertically pro-jecting double bouquet and/or bipolar cells (Dou-glas and Martin, 2004; Kawaguchi and Kubota, 1993;Markram et al., 2004). Following Kawaguchi (1995)we will here refer to the latter type as regular spikingnon-pyramidal or RSNP cells.

The pyramidal cells are modeled with six com-partments and the inhibitory interneurons with three.

-0.08

-0.07

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

2 2.2 2.4 2.6 2.8 3 3.2 3.4

mem

bran

e po

tent

ial (

V)

time (s)

Figure 6: Model pyramidal neuron switching fromDOWN to UP state and back. The mean membranepotential becomes elevated at t = 2.5 s due to net-work activity when the neuron starts participatingin an active attractor state. The attractor state wasactivated through simulated electrical stimulation ofother member neurons. (The varying height of actionpotentials is an artefact due to a large plotting timestep.)

Voltage-dependent ion channels for Na+, K+, Ca2+,and Ca2+-dependent K+-channels are modeled usingHodgkin-Huxley formalism.

The excitatory synapses between pyramidal cellsand between pyramidal and RSNP cells are of a mixedkainate/AMPA and NMDA type while the pyramidalto basket cell synapse only has a kainate/AMPA com-ponent and the inhibitory synapses are of GABAA

type.The network connectivity is set up to store a num-

ber of memory patterns (attractor states) such aswould have resulted from long-term plasticity usinga Hebbian learning rule (Sandberg et al., 2002). Thesynaptic plasticity actually included in the model isrestricted to fast synaptic depression of pyramidal topyramidal synapses.

Axonal delays are determined by using a geomet-ric model of the cortical sheet grown using cellularautomata. Figure 5 shows the spatial layout of mini-columns in a network with 9 hypercolumns. For fur-ther details see Lundqvist et al. (2006).

The raster plot in Figure 7 shows a typical ex-ample of network dynamics in our simulations. Thenetwork spends a few hundreds of milliseconds in eachattractor state. For simplicity, the memory patternsstored in this network are orthogonal—every minicol-umn only belongs to a single pattern. Since patternscompete within each hypercolumn, this gives rise to

14

Page 17: KTH Computer Science and Communication

0

500

1000

1500

2000

0 1 2 3 4 5 6

cell

inde

x (R

SN

P, p

yram

id, b

aske

t)

time (s)

Figure 7: Raster plot for network with 9 hypercolumns and 8 minicolumns per hypercolumn. The lowerportion of the raster shows activity in all RSNP cells, mid portion shows pyramidal cells, and, upper portionbasket cells. The long-range pyramidal-pyramidal and pyramidal-RSNP synapses store orthogonal memories.The network can be seen jumping spontaneously between these memory states.

the structure of regular bands in the figure—only oneminicolumn wins in each hypercolumn. Dependingon the level of background drive, attractor states caneither be attained spontaneously (as in the figure) orby afferent input to neuronal members of the mem-ory pattern. The average spike frequency for pyra-midal cells participating in an active memory state isaround 6 Hz, compared to a base line firing rate of0.2 Hz. Competition between attractors is resolvedwithin 30-40 ms.

A striking feature of our model is that this dynam-ics, which corresponds well to data on UP states (seeFigure 6) seen in vivo (Anderson et al., 2001) and inslice preparations (Cossart et al., 2003), is preservedover all model sizes and structures we have yet sim-ulated. That is, even for the largest simulation de-scribed in this report, a global memory pattern canbe correctly recalled within tens of milliseconds.

4.1 Results

When considering how to use super-computing powerto boost performance of a neuronal network simula-tor, there are two main directions in which to trimthe software: On one hand, we can use the increasedcapacity for computing to run larger models. On theother, we can use it to compute faster. In this workwe have primarily focused on the first possibility. Af-

ter the adaptations of our software tool SPLIT de-scribed in section 3, we obtained generally good scal-ing performance, particularly for large models wherecomputation time at the nodes dominates overheaddue to communication.

Figure 8 shows a scaling diagram for a version ofthe model described in section 4 containing 4 mil-lion cells and 2 billion synapses. Real pyramidal cellshave in the order of 10000 synapses. The numberof synapses in our model is lower than this due tothe fact that we only consider connections within onelayer of a local area, and, that the inter-minicolumnconnectivity contains a small number of memory pat-terns due to their orthogonality. Note, though, thatthe number of active synapses is the same as in anetwork with a more natural connection matrix.

Setup and simulation wall-clock times for 1 secondof simulated activity are shown for various numberof slave processes. The model comprises 1225 hy-percolumns with 100 minicolumns per hypercolumn,Note how, even for a model of this size, setup timedecreases with increasing number of processes, in ac-cordance with the earlier results described in section3.3.3.

Note also that simulation time is roughly inverselyproportional to the number of processes. This isshown more clearly in the speedup diagram in Fig-ure 9, which, for practical reasons, is normalized rel-

15

Page 18: KTH Computer Science and Communication

0

2000

4000

6000

8000

10000

500 1000 1500 2000

time

in s

econ

ds

number of MPI tasks

SetupSimulate

Total

Figure 8: Scaling diagram for model with 4 million cells and 2 billion synapses. Wall-clock setup andsimulation time for 1 sec of simulated activity.

0

1

2

3

4

500 1000 1500 2000

spee

dup

com

pare

d to

511

task

s: T

(511

)/T

(P)

number of MPI tasks: P

idealactual

Figure 9: Speedup for model with 4 million cells and 2 billion synapses.

16

Page 19: KTH Computer Science and Communication

# procs. Setup Simulation1024 2272 94892048 1582 4749

Table 1: Setup and simulation times in seconds for amodel with 8 million cells and 4 billion synapses on1024 and 2048 BG/L processors.

ative to the data point for 511 processes rather than1. The lower trace shows that scaling is nearly linear.

The largest model we have simulated so far is anetwork consisting 2401 hypercolumns (see table 1 forsetup and simulation times on 1024 and 2048 proces-sors of one full BG/L rack). This model comprises 8million neurons and 4 billion synapses.

Even for the largest models, the brain-like dynam-ics we have observed (see section 4 and Lundqvistet al. (2006)) in the network is preserved, althoughsome phenomena show quantitative changes which wereport on separately (Djurfeldt et al., 2006).

5 An abstract model ofneocortex

Here we present a connectionist abstract model ofcortex, and its implementation, which provides func-tional constraints for the biologically detailed modeldescribed in section 4. Both the computational andmemory requirements for this model are smaller thanfor its more biologically detailed counterpart. In-stances of this model that are a few orders of magni-tude larger than the biologically detailed version canrun in close to real time. Areas of interest for thistype of connectionist model are to explore, test, and,verify theories on how the neocortex processes infor-mation in cases which are difficult to investigate witha computationally more demanding model.

As described in section 4, the abstract model (Jo-hansson and Lansner, 2004a,b) on which the biologi-cally detailed model is based, makes the assumptionthat a cortical minicolumn is the smallest functionalunit in cortex. Each such minicolumn is modeledby a single unit in the abstract network. One hun-dred such units are grouped into a hypercolumn. Thefunctional role of the hypercolumn is to normalize theactivity in its constitutive units to one. The averageneuron in the mammalian brain has approximately8000 synapses, and a cortical minicolumn has on av-erage 100 neurons (Johansson and Lansner, 2004b).With the assumption that five synapses are requiredto link two minicolumns together, the average mini-column potentially has 120000 incoming connections

(Johansson and Lansner, 2004a).In this report we study networks of randomly con-

nected units. This type of connectivity is very gen-eral and is often used in attractor neural networks.In turn, the dynamics of attractor neural networksare hypothesized to be a first and very rough approx-imation of the cortical dynamics (Rolls and Treves,1998; Palm, 1982).

In the abstract model, the hypercolumn has acomputational grain size that maps very well ontothe computational resources in a cluster computer.The hypercolumn has a constant number of units andconnections, and in turn constant computational re-quirements.

5.1 Bayesian Confidence PropagatingNeural Network

The units and synapses of the abstract model areimplemented with a Bayesian Confidence Propagat-ing Neural Network (BCPNN) (Lansner and Ekeberg,1989; Lansner and Holst, 1996; Sandberg et al., 2002).This type of neural network can be used to implementboth feed-forward classifiers and attractor memory.In the latter case, it is similar to a Hopfield net-work, the main difference being the local structuresimposed by the hypercolumns. It has a Hebbiantype of learning-rule, which is local in both spaceand time (only the activity of the pre- and postsy-naptic units at one particular moment are neededto update the weights) and therefore it can be ef-ficiently parallelized. Further, the network can beused with both unary-coded activity (spiking activ-ity) and real-valued activity, o ∈ (0, 1). In the currentimplementation we use spiking activity that allows forefficient address event communication (AER) (Baileyand Hammerstrom, 1988; Mortara and Vittoz, 1994;Deiss et al., 1999; Mattia and Giudice, 2000). Thenetwork has N units grouped into H hypercolumnswith Uh units in each. Here, h is the index of a par-ticular hypercolumn and Qh is the set of all unitsbelonging to hypercolumn h.

The computations of the network can be dividedinto two parts; training (Figures 10–12) and retrieval(equations 1–3). In the training phase the weights,wij , and biases, βj , are computed. In the retrievalphase the units’ potential and activities, oi, are up-dated. The training phase is presented for three dif-ferent levels of computational complexity. Real-timeperformance for the computations is set to one up-date of both phases in 10 ms. Next, we first discussthe retrieval and then, in section 5.1.1, the trainingphase.

In the retrieval phase a process called relaxation

17

Page 20: KTH Computer Science and Communication

is run in which the potential and activity is updated.When using the network as an autoassociative mem-ory the activity is initialized to a noisy or a partialversion of one of the stored patterns. The relaxationprocess has two steps; first the potential, m, is up-dated (eq. 2) with the current support, s (eq. 1).Secondly, the new activity is computed from the po-tential by randomly setting a unit active accordingto the softmax distribution function in eq. 3. Theparameters τm = 10 and G = 3 are fixed throughout.

sj = log(βj) +H

h=1

log

k∈Qh

wkjok

(1)

τm

dmj

dt= sj −mj (2)

oj ←eGmj

k∈QheGmk

: j ∈ Qh (3)

for each h = {1, . . . , H}

5.1.1 Three implementations of the trainingphase

Here we present three different implementations ofthe learning phase, each with different computationalcomplexity. Up to three different state variables areused to represent a unit; which are called Z-, E-, andP-trace variables. A connection can have up to twostate variables; P- and E-trace. In the simplest im-plementation, with the lowest complexity, the units’P-trace variables are computed by leaky integratorsdirectly from the input as in Figure 10. Here, S is theinput and P is the P-trace state variable. This im-plementation uses delayed update of both unit andconnection variables. Further, these trace variablesare efficiently computed in the logarithmic domain(Johansson and Lansner, 2006a,b). The sparse con-nectivity is implemented with adjacency lists. We callthis a P-type implementation. A connection need 10bytes of memory; 4 bytes for storing Pij , 4 bytes forthe index of the presynaptic unit, and 2 bytes for thetime stamp associated with the delayed update.

In a slightly more complex implementation, calledZ-type, the units’ state variables, Z and P , are com-puted with two leaky integrators as in Figure 11.Here, we only implement the state variables of theconnections with delayed update. A connection inthe Z-type network requires as much memory as inthe P-type.

In most complex implementation, called E-type,two leaky integrators compute the state variables ofthe connections and three leaky integrators computethe state variables of the units as shown in Figure 12.Here, it is not possible to use delayed update for any

iP i i

dPS P

dtτ = − ij

P i j ij

dPS S P

dtτ = − j

P j j

dPS P

dtτ = −

iS jS

presynaptic activity postsynaptic activity

ijij

i j

Pw

PP=

logj jPβ =

Figure 10: The minimal set of equations needed toimplement an associative BCPNN, called P-type.

iZ i i

dZS Z

dtτ = −

iP i i

dPZ P

dtτ = − ij

P i j ij

dPZ Z P

dtτ = −

jZ j j

dZS Z

dtτ = −

jP j j

dPZ P

dtτ = −

iS jS

presynaptic activity postsynaptic activity

ijij

i j

Pw

PP=

logj jPβ =

Figure 11: The equations used to compute theweights, wij , and biases, βj , for the Z-type network.

of the state variables and hence the training time in-creases considerably. In each iteration during train-ing all connections are updated. A connection need12 bytes of memory; 4 bytes for storing Pij , 4 bytesfor storing Eij , 4 bytes for the index of the presynap-tic unit, and 2 bytes for the timestamp.

5.1.2 Implementation details

The abstract model was implemented with a focuson efficiency. Four techniques were used to achievegood performance; delayed updating of the units andconnections, computation of the leaky integrators inthe logarithmic domain, adjacency lists for effectiveindexing of the units in the sparse connectivity ma-trix, and AER for effective communication. We usedthe programming language C with calls to two stan-dard routines in the MPI application programminginterface; MPI Bcast() and MPI Allgather().

Further, a fixed-point arithmetic implementation

18

Page 21: KTH Computer Science and Communication

iZ i i

dZS Z

dtτ = −

iE i i

dEZ E

dtτ = −

iP i i

dPE P

dtτ = −

ij

E i j ij

dEZ Z E

dtτ = −

ij

P ij ij

dPE P

dtτ = −

jZ j j

dZS Z

dtτ = −

jE j j

dEZ E

dtτ = −

jP j j

dPE P

dtτ = −

iS jS

presynaptic activity postsynaptic activity

ijij

i j

Pw

PP=

logj jPβ =

Figure 12: The equations of the network with thehighest complexity, called E-type.

of the BCPNN has been developed. The target usefor this implementation is in hardware implementa-tions but it can also be advantageous to use in otherapplications because it reduces the memory require-ments with 30%.

5.2 Results

Since a node on BG/L only has 1/16 of the memoryof a node on Lenngren, the scaling experiments onBG/L were set up to use 16 times more nodes thancorresponding experiments on Lenngren. This meansthat the programs filled the memory on every nodeon both clusters. Scaling was studied under the con-dition that the problem size was increased togetherwith the number of processors, which is referred to asscaled speedup. If the program parallelizes perfectlythe scaled speedup measure is constant with increas-ing number of processors.

We did not use the speedup measure where therunning time of a problem of constant size is plottedas a function of the number of processors is increased,the reason being that the application studied has arelatively large and fixed part that runs on each node.When the problem size is scaled more than an orderof magnitude, this part influences the running timesfor the smaller sized networks.

Neither did we plot the scaling of the parallel pro-gram relative to an optimal implementation on a sin-gle processor. This is because the problem is verymemory intensive and it is not possible to run on a

Train Retrieval TotalP-type 0.0130 0.0196 0.0326Z-type 0.0270 0.0825 0.109E-type 2.29 0.0538 2.344

Table 2: The iteration times for three different net-work complexities on 128 processors on Lenngren.Time is measured in seconds.

Train Retrieval TotalP-type 0.007 0.0184 0.0254Z-type 0.018 0.0736 0.0916E-type 0.718 0.0415 0.756

Table 3: The iteration times for three different net-work complexities on 2048 processors (16 · 128) onBG/L. Time is measured in seconds.

single processor. We need to run it on a cluster com-puter not only for the processing power but also forthe available memory resources.

The problem size was varied in two ways; Firstly,the number of units was increased, each having a con-stant number of connections. Secondly, the numberof connections per unit was increased, holding thenumber of units constant. Here, we present scalingresults for both types of scaling.

5.2.1 Running times for a mouse sizednetwork

Here we run a network with 2048 hypercolumns and1.2 · 105 connections per unit. This network is 30%larger than a mouse cortex equivalent network (Jo-hansson and Lansner, 2004b,a).

Table 2 shows the iteration times on 128 pro-cessors on Lenngren. On each processor 16 hyper-columns were allocated. Running the Z-type networkwas 3.3 times slower and E-type network was 72 timesslower than a P-type network.

Table 3 shows the iteration times on 2048 proces-sors on BG/L. On each processor one hypercolumnwas allocated. Running the Z-type network was 3.6times slower and E-type network was 30 times slowerthan a P-type network.

The more complex units and connections were,the better the scaling became because more compu-tations were done in each iteration. In the case of themost simplistic network, the computations needed toadministrate the sparse connectivity and to computethe resident part of the network on each processorwere noticeable.

The retrieval times for the Z-type network were

19

Page 22: KTH Computer Science and Communication

the longest because two logarithms are computed foreach weight in the synaptic summation for each unit.In case of the P-type network no logarithm needs tobe computed in the synaptic summation, and for theE-type one logarithm is computed for each weight inthe synaptic summation.

5.2.2 Scaling of Hypercolumns

In Figure 13 the scaling, on Lenngren, of the networkwhen the number of hypercolumns was increased isplotted. The number of connections per unit wasfixed to 4 · 104. The retrieval times scaled linearlyand with a factor smaller than the rate by whichthe number of processors was increased. The train-ing times scaled linearly, but slightly faster, than therate by which the number of processors was increasedfor the P- and Z-type networks. The Z-type networkdoes not have delayed update of the units’ state vari-ables and, consequently, it has a poorer scaling thanthe P-type network. In case of the E-type network,the training times scaled no faster than the increasein processing power. This was because the trainingphase of the E-type network is much more computa-tionally demanding than for the other two networktypes.

In Figure 14, the scaling on BG/L is showed forthe same three networks but run on 16 times moreprocessors. The general results were the same withone exception; the training times for the computa-tionally intensive E-network increase slightly fasterthan the increase in processing power. The probablecause is that the execution time of the resident partof the problem was becoming proportionally largersince the problem is dived into a smaller computa-tional grain size.

5.2.3 Scaling of Connections

In Figure 15 the scaling, on Lenngren, of a P-type net-work when the number of connections per unit wasincreased is plotted. The number of hypercolumnswas fixed to 2048. Figure 16 shows the iteration timeson BG/L, where 16 times more processors were usedthat on Lenngren. From these results we can con-clude that neither the training nor the retrieval timesscales faster than the increase in processing powerwhen more connections are added.

5.2.4 A network model of record size

The largest simulation was a P-type network run on256 nodes on Lenngren. This network had 1.6 millionunits and 200 billion connections. The iteration timeof the training phase was 51 ms and for the retrieval

20 40 60 80 100 1200

0.02

0.04

0.06

0.08

0.1

0.12

Processors

Tim

e pe

r Ite

ratio

n (s

)

trainingretrieval

20 40 60 80 100 1200

0.02

0.04

0.06

0.08

0.1

0.12

Processors

Tim

e pe

r Ite

ratio

n (s

)

trainingretrieval

20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

Processors

Tim

e pe

r Ite

ratio

n (s

)

trainingretrieval

Figure 13: The iteration times on Lenngren for net-works with 64 hypercolumns per processor. The topplot is for P-type, the middle for Z-type, and thelower plot is for E-type networks. The number ofconnections per unit was fixed to 40000.

20

Page 23: KTH Computer Science and Communication

200 400 600 800 1000 1200 1400 1600 1800 20000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Processors

Tim

e pe

r Ite

ratio

n (s

)

trainingretrieval

200 400 600 800 1000 1200 1400 1600 1800 20000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Processors

Tim

e pe

r Ite

ratio

n (s

)

trainingretrieval

200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Processors

Tim

e pe

r Ite

ratio

n (s

)

trainingretrieval

Figure 14: The iteration times on BG/L for networkswith four hypercolumns per processor. The top plot isfor P-type, the middle for Z-type, and the lower plotis for E-type networks. The number of connectionsper unit was fixed to 40000.

20 40 60 80 100 1200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Processors

Tim

e pe

r Ite

ratio

n

trainingretrieval

Figure 15: The iteration times on Lenngren for a P-type network with a constant number of 2048 hyper-columns.

phase it was 61 ms, which is close to real-time perfor-mance. The network was used for doing pattern com-pletion and noise reduction on color images. A testi-mony of the code’s efficiency comes from setting theunofficial record in power use on the Lenngren clus-ter. Lenngren actually used more power to run thisnetwork than it used when running the benchmarkprograms for the top-500 supercomputers list. Thepower usage peaked at around 120 kW, which meantthat the average connection drew 6 · 10−7 W. Trans-lated into the biological equivalent this is roughly10−8 W per synapse. The corresponding figure forthe human brain is 3 · 10−14 W per synapse giventhat the human body consumes energy at a rate of150 W (Henry, 2005) and that 20% of this energy isused by the 1011 neurons and 1015 synapses of thebrain (Kandel et al., 2000).

6 Discussion

A reason sometimes given for staying away from largescale neuronal network models is that there would beno point to build a biophysically detailed model of abrain scale neuronal network since the model wouldbe as complex as the system it represents and equallyhard to understand. This is a grave misconception,since, on the contrary, if we actually had an exactquantitative model of a human brain in the computer,which, of course, is impossible, it would be extremelybeneficial. Obviously, since we would have full accessto every nitty-gritty detail in this “brain” that would

21

Page 24: KTH Computer Science and Communication

200 400 600 800 1000 1200 1400 1600 1800 20000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Processors

Tim

e pe

r Ite

ratio

n

trainingretrieval

Figure 16: The iteration times on BG/L for a P-type network with a constant number of 2048 hyper-columns.

dramatically speed up progress in understanding ofthis extremely complex structure. Unfortunately, orperhaps fortunately, this is a complete utopia, andwe have to bear with much simpler models in theforeseeable future.

6.1 Trade-off between numbers andcomplexity

The starting point for network modeling is the com-ponent cells and their synaptic interactions, includingtransmission, conduction delays, and, synaptic plas-ticity. The level of complexity of model neurons nowbecomes more of an issue. Given our limited compu-tational resources, there is a trade-off between com-plexity of the cell models and the size of the network.It is a painful fact that a high-end PC of today getsunbearably slow as model network size grows beyondsome hundred cells.

There is some light in the tunnel, however, sinceone reason for avoiding large-scale or full-scale simu-lations is now gradually disappearing. Although thereduction of physical dimensions of integrated circuitsand increasing clock speed may be approaching theirlimits, the number of processors operating in paral-lel in computers is now starting to increase dramat-ically. Parallel computing is beginning to reach theconsumer market resulting in a drop in prices. Itis quite likely that we will have desktop boxes withhundreds of parallel processors within the next fiveyears. Since neuronal networks represent computa-tionally homogeneous computational structures, par-

allel simulation is relatively straightforward. Thus, inthe near future, computer power will no longer pre-vent us from putting together and study large andeven full-scale models of global brain networks.

6.2 Number of free parameters

A real and unavoidable, but disturbing, fact is thatthe number of free parameters increases as you gofrom the single cell model to a network with a num-ber of cell types and different types of interactionsbetween them. Even rather simplistic, conductancebased multi-compartmental model neurons comprisetens of equations and hundreds of parameters. Thenumber of synapses in a big network model are manymore and they too require some equations and param-eters. Thus, any brain-scale network model wouldcontain on the order of billions of parameters. Evenwith advanced, automated, parameter search it wouldbe hopeless to find a reasonably adequate combina-tion of these massive amounts of parameters. Thisappears to put unreachable demands on the avail-ability of experimental data to constrain the model.Fortunately, the situation is not that bad. Neuronalnetworks are typically described as comprising a quitelimited number of cell types, with basically the sameproperties but some variation within the population.The parameter used for one neuron of a certain typeis likely to do for the others as well, possibly withsome compact description of a distribution arounda mean. This holds for the synaptic interactions aswell, though now we need to consider pairs of celltypes. Moreover, synaptic conductances are not de-termined arbitrarily, but are presumably the result ofthe action of some kind of, so far unknown, learningrule coupling them to historical network activity.

Thus, the good news is that the number of trulyfree parameters is to some extent independent of theactual number of neurons and synapses used in amodel. The same averages, distributions and learn-ing rules hold for the huge network model as well asthe tiny one. Knowledge about the distribution ofcell and synaptic properties of course becomes im-portant. On the other hand, the actual details of,e.g., dendritic arborization of one individual cell isless important than in single cell modeling, since thisis a typical thing that varies within the population.What really increases with increasing numbers is thematch between model and reality.

22

Page 25: KTH Computer Science and Communication

6.3 Cost of complex cell models inlarge scale simulations

Large-scale network models shift the balance so thatsynaptic complexity takes over as the limiting factor.Somewhat unintuitive, for really large networks itcomes with little extra cost to have complex cell mod-els! Contrary to the case for single cells and small net-works, the solution methods used for, e.g., dendriticintegration does not add significantly to the cost ofcomputation, since synaptic computation dominates.

Furthermore, perhaps surprisingly, parallel neuralsimulation is often bound by local computation andnot by inter-neuronal communication. A factor ofmajor influence on communication speed is whetherneuronal interactions can be represented by spikingevents or if they are graded. It is crucial for scala-bility that the action potential can be represented asa binary event and that we can use AER (AddressEvent Representation) based communication (Baileyand Hammerstrom, 1988). As soon as we need to in-corporate graded interaction of some sort in the cell-to-cell interaction, like gap junctions or sub-thresholdtransmitter release, parallel simulations needs to beorganized very differently and the neuronal interac-tion becomes potentially performance limiting. Evenso, parallelization of the model will be beneficial inthis case as well, and good simulator design will beable to provide reasonable simulation times in suchcases as well. Given this development we are likely tosee an increasing use of complex network modeling inneuroscience in the coming years.

6.4 Mapping the model onto the BlueGene/L torus

In BG/L, every node is connected to six neighboursthrough the 3D torus network. Communication isparticularly efficient with nodes a small number ofhops away. Since SPLIT makes extensive use of pair-wise communication, it should be possible to gain effi-ciency from this but we have not yet explored how tomap our model onto the torus. As described above,the SPLIT library already has support for specifica-tion of the mapping from neurons to slave processes.The solution should be fairly straightforward sinceour cortex model has an inherent 2D geometry wheremost data is exchanged between 2D-neighbors. Theproblem thus reduces to folding a 2D sheet onto the3D torus.

7 Conclusions

Our study demonstrates that supercomputing poweris now sufficient to simulate neuronal networks withnumbers of model neurons and synapses approach-ing those in the brains of small mammals. Mostimportantly, this can be done without totally sac-rificing the level of biological detail in neuron andsynapse models. The model neurons used in our sim-ulations were multi-compartmental and of a conduc-tance based, Hodgkin-Huxley type, representing anintermediate level of biological detail.

The largest network model simulated so far com-prises eight million neurons and four billion synap-tic connections. Considering that we only simulatespecific cell types in one cortical layer, this corre-sponds to 240000 minicolumns or almost half of therat neocortex (Johansson and Lansner, 2004a). Toour knowledge this is, by far, the largest simulationof this type ever performed. It takes about one anda half hour to simulate one second of dynamic net-work activity. Simulation time is dominated by localcomputation and there is a significant potential gainof further parallelization of local computation.

We have also shown that an abstract, connection-ist type, model of neocortex has linear scaling charac-teristics, both when the number of units and connec-tions is increased. Generally the model scales as ZH ,where Z is the number of connections per unit andH the number of hypercolumns. The largest networksimulated had about a million units and 200 billionconnections. It takes about ten seconds to simulateone second of network activity, including synapticplasticity. Finally, we conclude that the simulatednetwork requires five to six orders of magnitude moreenergy per synapse than its biological counterpart;the human brain.

Four techniques were used to achieve good per-formance; delayed update, computation of the leakyintegrators in the logarithmic domain, adjacency listsfor effective indexing of the units in the sparse con-nectivity matrix, and AER for effective communica-tion. For future explorations of the brain’s cognitivefunctions, this type of abstract model is an importanttool.

Finally, we conclude that the simulation tools de-scribed here can efficiently utilize the massive paral-lelism in today’s fastest computer, Blue Gene/L.

8 Acknowledgments

We would like to thank Carl G. Tengwall and ErlingWeibust, Blue Gene Solutions, IBM Deep ComputingEMEA, IBM Svenska AB, and, Cindy Mestad, James

23

Page 26: KTH Computer Science and Communication

C. Pischke, Steven M. Westerbeck and Michael Hen-necke for excellent support when running on the BlueGene/L installation at the Deep Computing Capacityon Demand Center, IBM, Rochester. Many thanksalso to PDC, KTH, and, SNIC, for access to the Dellcluster Lenngren.

References

Amit, D. (1989). Modeling Brain Function: TheWorld of attractor neural networks. CambridgeUniversity Press, New York.

Anderson, J. S., Lampl, I., Gillespie, D. C., and Fer-ster, D. (2001). Membrane potential and conduc-tance changes underlying length tuning of cells incat primary visual cortex. J Neurosci, 21(6):2104–12.

Bailey, J. and Hammerstrom, D. (1988). Why vlsiimplementations of associative vlcns require con-nection multiplexing. In International Conferenceon Neural Networks, San Diego, U.S.A.

Blakeslee, B. and McCourt, M. E. (2004). A unifiedtheory of brightness contrast and assimilation in-corporating oriented multiscale spatial filtering andcontrast normalization. Vision Res, 44(21):2483–503.

Bower, J. M. and Beeman, D. (1998). The book ofGENESIS: Exploring realistic neural models withthe GEneral NEural SImulation System. Springer-Verlag, New York, 2 edition. ISBN 0-387-94938-0.

Churchland, P. S. and Sejnowski, T. J. (1992). TheComputational Brain. The MIT Press, Cambridge,Massachusetts.

Compte, A., Brunel, N., Goldman-Rakic, P. S., andWang, X. J. (2000). Synaptic mechanisms and net-work dynamics underlying spatial working mem-ory in a cortical network model. Cereb Cortex,10(9):910–23.

Cossart, R., Aronov, D., and Yuste, R. (2003). At-tractor dynamics of network up states in the neo-cortex. Nature, 423(6937):283–8.

Deiss, S. R., Douglas, R. J., and Whatley, A. M.(1999). A pulse-coded communication infrastruc-ture for neuromorphic systems. In Maass, W. andBishop, C. M., editors, Pulsed Neural Networks,pages 157–178. The MIT Press, Cambridge, Mas-sachusetts.

Djurfeldt, M., Rehn, M., Lundqvist, M., and Lansner,A. (2006). Attractor dynamics in a modular net-work model of neocortex. In preparation.

Douglas, R. J. and Martin, K. A. (2004). Neu-ronal circuits of the neocortex. Annu Rev Neurosci,27:419–51.

Fransen, E. and Lansner, A. (1995). Low spikingrates in a population of mutually exciting pyrami-dal cells. Network: Computation in Neural Sys-tems, 6(2):271–288.

Fransen, E. and Lansner, A. (1998). A model of cor-tical associative memory based on a horizontal net-work of conneted columns. Network: Computationin Neural Systems, 9:235–264.

Frye, J. (2005). http://brain.cse.unr.edu/ncsdocs/.

Fuster, J. M. (1995). Memory in the Cerebral Cortex.The MIT Press, Cambridge, Massachusetts.

Gara, A., Blumrich, M. A., Chen, D., Chiu, G. L.-T., Coteus, P., Giampapa, M. E., Haring, R. A.,Heidelberger, P., Hoenicke, D., Kopcsay, G. V.,Liebsch, T. A., Ohmacht, M., Steinmacher-Burow,B. D., Takken, T., and Vranas, P. (2005). Overviewof the Blue Gene/L system architecture. IBM Jour-nal of Reasearch and Development, 49(2/3):195–212.

Gross, J. L. and Yellen, J. (1998). Graph theory andits applications. CRC Press, Boca Raton, Florida,2 edition.

Haberly, L. B. and Bower, J. M. (1989). Olfactorycortex: model circuit for study of associative mem-ory? Trends Neurosci, 12(7):258–64.

Hammarlund, P. and Ekeberg, O. (1998). Large neu-ral network simulations on multiple hardware plat-forms. J Comput Neurosci, 5(4):443–59.

Henry, C. (2005). Basal metabolic rate studies in hu-mans: measurement and development of new equa-tions. Public Health Nutr, 8(7A):1133–52.

Hertz, J., Krogh, A., and Palmer, R. (1991). In-troduction to the Theory of Neural Computation.Addison-Wesley, Redwood, CA.

Hines, M. and Carnevale, N. T. (1997). The neuronsimulation environment. Neural Comput., 9:1179–1209.

Hopfield, J. J. (1982). Neural networks and physi-cal systems with emergent collective computationalabilities. Proceedings of the National Academy ofSciences, USA, 79:2554–2558.

24

Page 27: KTH Computer Science and Communication

Hubel, D. H. and Wiesel, T. N. (1977). Ferrierlecture. functional architecture of macaque mon-key visual cortex. Proc R Soc Lond B Biol Sci,198(1130):1–59.

Johansson, C. and Lansner, A. (2004a). Towards cor-tex sized artificial nervous systems. In Negoita,M. G., Howlett, R. J., and Jain, L. C., editors,Knowledge-Based Intelligent Information and En-gineering Systems, volume 3213 of Lecture Notesin Artificial Intelligence, pages 959–966, Berlin.Springer.

Johansson, C. and Lansner, A. (2004b). Towardscortex sized attractor ann. In Ijspeert, A. J.,Masayuki, M., and Naoki, W., editors, First In-ternational Workshop on Biologically Inspired Ap-proaches to Advanced Information Technology, vol-ume 3141 of Lecture Notes in Computer Science,pages 63–79, Berlin. Springer.

Johansson, C. and Lansner, A. (2006a). A fixed-point arithmetic implementation of an exponen-tially weighted moving average. Submitted to IEEETrans. on Neural Networks.

Johansson, C. and Lansner, A. (2006b). Towards cor-tex sized artificial neural systems. Submitted toNeural Networks.

Kandel, E. R., Schwartz, J. H., and Jessell, T. M.,editors (2000). Principles of Neural Science.McGraw-Hill, New York, 4th edition.

Kawaguchi, Y. (1995). Physiological subgroups ofnonpyramidal cells with specific morphological cha-racteristics in layer II/III of rat frontal cortex.J. Neurosci., 15:2638–2655.

Kawaguchi, Y. and Kubota, Y. (1993). Correla-tion of physiological subgroupings of nonpyrami-dal cells with parvalbumin- and calbindind28k-immunoreactive neurons in layer v of rat frontalcortex. J Neurophysiol, 70(1):387–96.

Lansner, A. and Ekeberg, O. (1989). A one-layerfeedback artificial neural network with a bayesianlearning rule. International Journal of Neural Sys-tems, 1(1):77–87.

Lansner, A. and Fransen, E. (1992). Modeling heb-bian cell assemblies comprised of cortical neurons.Network: Computation in Neural Systems, 3:105–119.

Lansner, A., Fransen, E., and Sandberg, A. (2003).Cell assembly dynamics in detailed and abstract at-tractor models of cortical associative memory. The-ory Biosci, 122:19–36.

Lansner, A. and Holst, A. (1996). A higher orderBayesian neural network with spiking units. Inter-national Journal of Neural Systems, 7(2):115–128.

Lundqvist, M., Rehn, M., Djurfeldt, M., and Lansner,A. (2006). Attractor dynamics in a modular net-work model of neocortex. Network: Computationin Neural Systems. Submitted.

Markram, H., Toledo-Rodriguez, M., Wang, Y.,Gupta, A., Silberberg, G., and Wu, C. (2004).Interneurons of the neocortical inhibitory system.Nat Rev Neurosci, 5(10):793–807.

Mattia, M. and Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neu-rons and dynamical synapses. Neural Computation,12:2305–2329.

Morrison, A., Mehring, C., Geisel, T., Aertsen,A., and Diesmann, M. (2005). Advancing theboundaries of high-connectivity network simulationwith distributed computing. Neural Computation,17:1776–1801.

Mortara, A. and Vittoz, E. (1994). A communica-tion architecture tailored for analog vlsi artificialneural networks: intrinsic performance and limi-tations. IEEE Transactions on Neural Networks,5(3):459–466.

Mountcastle, V. B. (1978). An organizing principlefor cerebral function: The unit module and the dis-tributed system. In Edelman, G. M. and Mount-castle, V. B., editors, The mindful brain. The MITPress, Cambridge, Massachusetts.

Palm, G. (1982). Neural assemblies. An alternativeapproach to artificial intelligence. Springer.

Peters, A. and Sethares, C. (1991). Organization ofpyramidal neurons in area 17 of monkey visual cor-tex. J Comp Neurol, 306(1):1–23.

Rolls, E. T. and Treves, A. (1998). Neural Networksand Brain Function. Oxford University Press, NewYork.

Sandberg, A., Lansner, A., Petersson, K. M., andEkeberg, O. (2002). Bayesian attractor networkswith incremental learning. Network: Computationin neural systems, 13:179–194.

Sandberg, A., Tegner, J., and Lansner, A. (2003).A working memory model based on fast hebbianlearning. Network, 14(4):789–802.

25

Page 28: KTH Computer Science and Communication

Tam, A. and Wang, C. (2000). Efficient schedul-ing of complete exchange on clusters. In 13th In-ternational Conference on Parallel and DistributedComputinh Systems, Las Vegas, USA.

Treves, A. and Rolls, E. T. (1994). Computationalanalysis of the role of the hippocampus in memory.Hippocampus, 4(3):374–91.

Willshaw, D. and Longuet-Higgins, H. (1970). Asso-ciative memory models. In Meltzer, B. and Michie,O., editors, Machine Learning, volume 5. Edin-burgh University Press, Edinburgh, Scotland.

26