[Berger, T] Living Information Theory

22

Transcript of [Berger, T] Living Information Theory

Page 1: [Berger, T] Living Information Theory

LIVING INFORMATION THEORYThe 2002 Shannon Le turebyToby BergerS hool of Ele tri al and Computer EngineeringCornell University, Itha a, NY 148531 Meanings of the TitleThe title, "Living Information Theory," is a triple entendre. First and foremost, it pertainsto the information theory of living systems. Se ond, it symbolizes the fa t that our resear h ommunity has been living information theory for more than �ve de ades, enthralled withthe beauty of the subje t and intrigued by its many areas of appli ation and potentialappli ation. Lastly, it is intended to onnote that information theory is de idedly alive,despite sporadi protestations to the ontrary. Moreover, there is a thread that ties togetherall three of these meanings for me. That thread is my strong belief that one way in whi hinformation theorists, both new and seasoned, an assure that their subje t will remainvitally alive deep into the future is to embra e enthusiasti ally its appli ations to the lifes ien es.2 Early History of Information Theory in BiologyIn the 1950's and early 1960's a adre of s ientists and engineers were adherents of thepremise that information theory ould serve as a al ulus for living systems. That is, theybelieved information theory ould be used to build a solid mathemati al foundation forbiology whi h always had o upied a pe uliar middle ground between the hard and the softs ien es. International meetings were organized by Colin Cherry and others to explore thisfrontier, but by the mid-1960's the e�ort had dissipated. This may have been due in partto none other than Claude Shannon himself, who in his guest editorial, The Bandwagon, inthe Mar h 1956 issue of the IRE Transa tions on Information Theory stated:Information theory has ... perhaps ballooned to an importan e beyond its a tuala omplishments. Our fellow s eintists in many di�erent �elds, attra ted by thefanfare and by the new avenues opened to s ienti� analysis, are using theseideas in ... biology, psy hology, linguisti s, fundamental physi s, e onomi s, thetheory of the organization, ... Although this wave of popularity is ertainlypleasant and ex iting for those of us working in the �eld, it arries at the sametime an element of danger. While we feel that information theory is indeed avaluable tool in providing fundamental insights into the nature of ommuni ationproblems and will ontinue to grow in importan e, it is ertainly no pana ea forthe ommuni ation engineer or, a fortiori, for anyone else. Seldom do more thana few of nature's se rets give way at one time.More devastating was Peter Elias's s athing 1958 editorial in the same journal, Two FamousPapers whi h in part read:

Page 2: [Berger, T] Living Information Theory

The �rst paper has the generi title Information Theory, Photosynthesis andReligion... written by an engineer or physi ist ... I suggest we stop writing [it℄,and release a supply of man power to work on ... important problems whi hneed investigation.The demise of the nas ent ommunity that was endeavoring to inje t information theoryinto mainstream biology probably was o asioned less by these \purist" information theoryeditorials than by the relatively primitive state of quantitative biology at the time. Note inthis regard that:1. The stru ture of DNA was not determined by Cri k and Watson until �ve years afterShannon published A Mathemti al Theory of Communi ation.2. It was not possible to measure even a single neural pulse train with millise ond a - ura y; ontrastingly, today it is possible simultaneously to re ord a urately in vivothe pulse trains of many neighboring neurons as an aid to developing an informationtheory of real neural nets.3. It was not possible to measure time variations in the on entrations of hemi alsat sub-millise ond speeds in volumes of submi ron dimensions su h as those whi h onstitute ion hannels in neurons. This remains a stumbling blo k, but measurementte hniques apitalizing on uores en e and other phenomena are steadily progressingtoward this goal.We o�er arguments below to support the premise that matters have progressed to a stageat whi h biology is positioned to pro�t meaningfully from an invasion by information theo-rists. Indeed, during the past de ade some biologists have equipped themselves with morethan a surfa e knowledge of information theory and are applying it orre tly and fruitfullyto sele ted biologi al subdis iplines, notable among whi h are genomi s and neuros ien e.Sin e our interest here is in the information theory of sensory per eption, we will dis ussneuros ien e and es hew genomi s.3 Information Within OrganismsAt a fundamental level information in a living organism is instantiated in the time vari-ations of the on entrations of hemi al and ele tro hemi al spe ies (ions, mole ules and ompounds) in the ompartments that omprise the organism. Chemi al thermodynam-i s and statisti al me hani s tell us that these on entrations are always tending toward amultiphase equilibrium hara terized by minimization of the Helmholtz free energy fun -tional. On the other hand, omplete equilibrium with the environment never is attainedboth be ause the environment onstantly hanges and be ause the organism must exhibithomeostasis in order to remain \alive". A fas inating dynami prevails in whi h the or-ganism sa ri� es internal energy in order to redu e its un ertainty about the environment,whi h in turn permits it to lo ate new sour es of energy and �nd mates with whom toperpetuate the spe ies. This is one of several onsiderations that strongly militate in favorof looking at an information gain by a living system never in absolute terms but ratheralways relative to the energy expended to a hieve it.There is, in addition, an intriguing mathemati al analogy between the equations that gov-ern multiphase equilibrium in hemi al thermodynami s and those whi h spe ify points on2

Page 3: [Berger, T] Living Information Theory

Shannon's rate-distortion fun tion of an information sour e with respe t to a �delity ri-terion [9℄. This analogy is not in this writer's opinion just a mathemati al uriosity butrather is entral to fruitfully "bringing information theory to life." We shall not be exploringthis analogy further here, however. This is be ause, although it provides an overar hingtheoreti al framework, it operates on a level whi h does not readily lead to on rete resultsapropos our goal of developing an information-theoreti ally based formulation of sensoryper eption.An information theorist venturing into new territory must treat that territory with respe t.In parti ular, one should not assume that, just be ause the basi on epts and methods de-veloped by Shannon and his dis iples have proved so e�e tive in des ribing the key featuresof man-made ommuni ation systems, they an be applied en masse to render expli ablethe long-standing mysteries of another dis ipline. Rather, one must think riti ally aboutinformation-theoreti on epts and methods and then apply only those that genuinely trans-fer to the new territory. My endeavors in this onne tion to date have led me to the followingtwo beliefs:� Judi ious appli ation of Shannon's fundamental on epts of entropy, mutual infor-mation, hannel apa ity and rate-distortion is ru ial to gaining an elevated under-standing of how living systems handle sensory information.� Living systems have little if any need for the elegant blo k and onvoulution odingtheorems and te hniques of information theory be ause, as will be explained below,organisms have found ways to perform their information handling tasks in an e�e -tively Shannon-otpimummanner without having to employ oding in the information-theoreti sense of the term.Is it ne essary to learn hemistry, bio hemistry, biophysi s, neuros ien e, and su h beforeone an make any useful ontributions? The answer, I feel, is \Yes, but not deeply." Theobje t is not to get to the point where you an think like a biologist. The obje t is to get tothe point where you an think like the biology. The biology has had hundreds of millions ofyears to evolve via natural sele tion to a point at whi h mu h of that whi h it does is donein a nearly optimum fashion. Hen e, thinking about how the biology should do things isoften e�e tively identi al to thinking about how the biology does do things and is perhapseven a more fruitful endeavor. 1Information theorists are fond of �guring out how best to transmit information over a\given" hannel. When trespassing on biologi al turf, however, an information theoristmust abandon the tenet that the hannel is given. Quite to the ontrary, nature has evolvedthe hannels that fun tion within organisms in response to needs for spe i� informationresiding either in the environment or in the organism itself - hannels for sight, hannels forsound, for olfa tion, for tou h, for taste, for blood al ohol and osmolality regulation, and soon. Common sense strongly suggests that biologi al stru tures built to sense and transferinformation from ertain sites lo ated either outside or inside the organism to other su hsites will be eÆ iently \mat hed" to the data sour es they servi e. Indeed, it would be1Elwyn Berlekamp related at IEEE ISIT 2001 in Washington, DC, a onversation he had with ClaudeShannon in an MIT hallway in the 1960's the gist of whi h was:CES: Where are you going, Elwyn?EB: To the library to study arti les, in luding some of yours.CES: Oh, don't do that. You'd be better o� to just �gure it out for yourself.3

Page 4: [Berger, T] Living Information Theory

ill-advised to expe t otherwise, sin e natural sele tion rarely hooses foolishly, espe ially inthe long run. The ompelling hypothesis, at least from my perspe tive, is that all biologi al hannels are well mat hed to the information sour es that feed them.4 Double Mat hing of Sour es and ChannelsMat hing a hannel to a sour e has a pre ise mathemati al meaning in information theory.Let us onsider the simplest ase of a dis rete memoryless sour e (dms) with instantaneousletter probabilities fp(u); u 2 Ug and a dis rete memoryless hannel (dm ) with instanta-neous transition probabilities fp(yjx); x 2 X ; y 2 Yg. Furthermore, let us suppose that the hannel's purpose is to deliver a signal fYkg to its output terminal on the basis of whi hone ould onstru t an approximation fVkg to the sour e data fUkg that is a urate enoughfor satisfa tory performan e in some appli ation of interest. Following Shannon, we shallmeasure said a ura y by means of a distortion measure d : U � V ! [0;1℄. fVkg will be onsidered to be a suÆ iently a urate approximation of fUkg if and only if the average dis-tortion does not ex eed a level deemed to be tolerable whi h we shall denote by D. Statedmathemati ally, our requirement for an approximation to be suÆ iently a urate islimn!1E n�1 nXk=1 d(Uk; Vk) � D:In order for the dm fp(yjx)g to be instantaneously mat hed to the ombination of the dmsfp(u)g and the distortion measure fd(u; v)g at �delity D, the following requirements mustbe satis�ed:1. The number of sour e letters produ ed per se ond must equal the number of timesper se ond that the hannel is available for use.2. There must exist two transition probability matri es fr(xju); u 2 U ; x 2 Xg andfw(vjy); y 2 Y; v 2 Vg, su h that the end-to-end transition probabilitiesq(vju) := Xx2XXy2Y r(xju)p(yjx)w(vjy); (u; v) 2 U � Vsolve the variational problem that de�nes the point (D;R(D)) on Shannon's rate-distortion fun tion of the dms fp(u)g with respe t to the distortion measure fd(u; v)g.Readers not onversant with rate-distortion theory should refer to Se tion 11 below. Ifthat does not suÆ e, they should ommune at their leisure with Shannon [4℄, Jelinek [10℄,Gallager [11℄ or Berger [9℄. However, the two key examples that follow should be largelya essible to persons unfamiliar with the ontent of any of these referen es. Ea h exampleis onstru ted on a foundation omprised of two of Shannon's famous formulas.5 Two Key Examples of Double Mat hingExample 1. This example uses: (1) the formula for the apa ity of a binary symmetri hannel (BSC) with rossover probability �, namelyC = 1� h(�) = 1 + � log2 �+ (1� �) log2(1� �) bits= hannel use;4

Page 5: [Berger, T] Living Information Theory

where we assume without loss of essential generality that � � 1=2, and (2) the formula forthe rate-distortion fun tion of a Bernoulli-1/2 sour e with respe t to the error frequen ydistortion measure d(x; y) = 1� Æ(x; y), namelyR(D) = 1� h(D) = 1 +D log2D + (1�D) log2(1�D) bits=sour e letter; 0 � D � 1=2:Shannon's onverse hannel oding theorem [1℄ establishes that it is not possible to sendmore than nC bits of information a ross the hannel via n hannel uses. Similarly, his onverse sour e oding theorem [4℄ establishes that it is not possible to generate an approx-imation V1; : : : ; Vn to sour e letters U1; : : : ; Un that has an average distortion E n�1Pnk=1d(Uk; Vk) of D or less unless that representation is based on nR(D) or more bits of infor-mation about these sour e letters. A ordingly, assuming the sour e resides at the hannelinput, it is impossible to generate an approximation to it at the hannel output that hasan average distortion any smaller than the value of D for whi h R(D) = C, even if thenumber n of sour e letters and hannel uses is allowed to be ome large. Comparing theabove formulas for C and R(D), we see that no value of average distortion less than � anbe a hieved. This is true regardless of how ompli ated an en oder we pla e between thesour e and the hannel, how ompli ated a de oder we pla e between the hannel and there ipient of the sour e approximation, and how large a �nite delay we allow the system toemploy.It is easy to see, however, that D = � an be a hieved simply by onne ting the sour edire tly to the hannel input and using the hannel output as the approximate re onstru -tion of the sour e output. Hen e, this trivial ommuni ation system, whi h is devoid of anysour e or hannel oding and operates with zero delay, is optimum in this example. Thereare two reasons for this:Reason One: The hannel is instantaneously mat hed to the sour e as de�ned above withthe parti ularly simple stru ture that X = U , V = Y, r(xju) = Æ(u; x) and w(vjy) = Æ(y; v).That is, the sour e is instantaneously and deterministi ally fed into the hannel, and the hannel output dire tly serves as the approximation to the sour e.Reason Two: The sour e also is mat hed to the hannel in the sense that the distributionof ea h Uk, and hen e of ea h Xk, is p(0) = p(1) = 1=2, whi h distribution maximizes themutual information between a hannel input and the orresponding output. That is, the hannel input letters are i.i.d. with their ommon distrubution being the one that solvesthe variational problem that de�nes the hannel's apa ity.Example 2. The hannel in this example is a time-dis rete, average-power- onstrainedadditive white Gaussian noise (AWGN). Spe i� ally, its kth output Yk equals Xk + Nk,where Xk is the kth input and the additive noises Nk are i.i.d. N (0; N) for k = 1; 2; : : :.Also, the average signaling power annot ex eed S, whi h we express mathemati ally bythe requirement limn!1E n�1 nXk=1X2k � S:Shannon's well-known formula for this hannel's apa ity isC = 12 log2(1 + SN ) bits= hannel use:The sour e in this example produ es i.i.d. N (0; �2) symbols fUkg. The squared errordistortion measure, d(u; v) = (v � u)2, is employed, so the end-to-end a ura y is the5

Page 6: [Berger, T] Living Information Theory

mean-squared-error, MSE = limn!1E n�1Xk (Vk � Uk)2:Shannon's elebrated formula for the rate-distortion fun tion of this sour e and distortionmeasure ombination is R(D) = (1=2) log2(�2=D); 0 � D � �2:The minimum a hievable value of the MSE is, as usual, the value of D that satis�es R(D) =C, whi h in this example is D = �2=(1 + SN ):As in Example 1, we �nd that this minimum value of D is trivially attainable withoutany sour e or hannel oding and with zero delay. However, in this instan e the sour esymbols must be s aled by � := pS=� before being put into the hannel in order to ensure omplian e with the power onstraint. Similarly, Vk is produ ed by multiplying Yk by the onstant � := pS�=(S +N), sin e this produ es the minimum MSE estimate of Uk basedon the hannel output. Hen e, the hannel is instantaneously mat hed to the sour e viathe deterministi transformations r(xju) = Æ(x � �u) and w(vjy) = Æ(v � �y). Moreover,the sour e is mat hed to the hannel in that, on e s aled by �, it be omes the hannelinput whi h, among all those that omply with the power onstraint, maximizes mutualinformation between itself and the hannel output that it eli its. Thus, the s aled sour edrives the onstrained hannel at its apa ity.It an be argued validly that, notwithstanding the fa t that Examples 1 and 2 deal withsour e models, hannel models, and distortion measures all dear to information theorists,these examples are ex eptional ases. Indeed, if one were to modify fp(u)g or fp(yjx)gor fd(u; v)g even slightly, there no longer would be an absolutely optimum system that isboth oding-free and delay-free. A hieving optimal performan e would then require theuse of oding s hemes whose omplexity and delay diverge as their end-to-end performan eapproa hes the minimum possible average distortion attainable between the sour e andan approximation of it based on information delivered via the hannel. However, if theperturbations to the sour e, to the hannel and/or to the distortion measure were minor,then an instantaneous system would exist that is only mildly suboptimum. Be ause of itssimpli ity and relatively low operating osts, this mildly suboptimum s heme likely wouldbe deemed preferable in pra ti e to a highly ompli ated system that is truly optimum inthe pure information theory sense.I have argued above for why it is reasonable to expe t biologi al hannels to have evolved soas to be mat hed to the sour es they monitor. I further believe that, as in Examples 1 and2, the data sele ted from a biologi al sour e to be onveyed through a biologi al hannelwill drive that hannel at a rate e�e tively equal to the its resour e- onstrained apa ity.That is, I postulate that double mat hing of hannel to sour e and of sour e to hannelin a manner analogous to that of Examples 1 and 2 is the rule rather than the ex eptionin the information theory of living systems. Indeed, suppose sele ted stimuli were to be onditioned for transmission a ross one of an organism's internal hannels in su h a way thatinformation failed to be onveyed at a rate nearly equal to the hannel's apa ity al ulatedfor the level of resour es being expended. This would make it possible to sele t additionaldata and then properly ondition and transmit both it and the original data through the6

Page 7: [Berger, T] Living Information Theory

hannel in a manner that does not in rease the resour es onsumed. To fail to use su h analternative input would be wasteful either of information or of energy, sin e energy usually isthe onstrained resour e in question. As explained previously, a fundamental hara teristi of an eÆ ient organism is that it always should be optimally trading information for energy,or vi e versa, as ir umstan es di tate. The only way to assure that pertinent informationwill be garnered at low laten y at the maximum rate per unit of power expended is notonly to mat h the hannel to the sour e but also to mat h the sour e to the hannel.6 Bit Rate and Thermodynami EÆ ien yWe shall now dis uss how in reasing the number of bits handled per se ond unavoidablyin reases the number of joules expended per bit (i.e., de reases thermodynami eÆ ien y).To establish this in full generality requires penetrating deeply into thermodynami s and sta-tisti al me hani s. We shall instead ontent ourselves with studying the energy-informationtradeo� impli it in Shannon's elebrated formula for the apa ity of an average-power- onstrained bandlimited AWGN hannel, namelyC(S) =W log(1 + SN0W );where S is the onstrained signaling power, W is the bandwidth in positive frequen ies,and N0 is the one-sided power spe tral density of the additive white Gaussian noise. Likeall apa ity- ost fun tions, C(S) is on ave in S. Hen e, its slope de reases as S in reases;spe i� ally, C 0(S) = W=(S + N0W ). The slope of C(S) has the dimensions of apa ityper unit of power, whi h is to say (bits/se ond)/(joules/se ond) = bits/joule. Sin e thethermodynami eÆ ien y of the information-energy tradeo� is measured in bits/joule, itde reases steadily as the power level S and the bit rate C(S) in rease. This militates infavor of gathering information slowly in any appli ation not hara terized by a stringentlaten y demand. To be sure, there are ir umstan es in whi h an organism needs to gatherand pro ess information rapidly and therefore does so. However, energy onservation di -tates that information handling always should be ondu ted at as leisurely a pa e as theappli ation will tolerate. For example, re ent experiments have shown that within theneo ortex a neural region sometimes transfers information at a high rate and a ordinglyexpends energy liberally, while at other times it onveys information at a relatively lowrate and thereby expends less than proportionately mu h energy. In both of these modes,and others in between, our hypothesis is that these oalitions of neurons operate in aninformation-theoreti ally optimum manner. We shall attempt to des ribe below how this isa omplished.7 Feedforward and Feedba k: Bottom-Up and Top-DownBefore turning in earnest to information handling by neural regions, we �rst need to gen-eralize and further expli ate the phenomenon of double mat hing of sour es and hannels.So far, we have dis ussed this only in the ontext of sour es and hannels that are memo-ryless. We ould extend to sour es and/or hannels with memory via the usual pro edureof blo king su essive symbols into a \supersympbol" and treating long supersymbols asnearly i.i.d., but this would in rease the laten y by a fa tor equal to the number of symbols7

Page 8: [Berger, T] Living Information Theory

per supersymbol, thereby defeating one of the prin ipal advantages of double mat hing. Wesuggest an alternative approa h below whi h leads to limiting the memory of many ru ialpro esses to at most �rst-order Markovianness.It has long been appre iated that neuromus ular systems and metaboli regulatory me ha-nisms exhibit masterful use of feedba k. Physiologi al measurements of the past �fteen orso years have in ontrovertibly established that the same is true of neurosensory systems.Des ribing signaling paths in the primate visual ortex, for example, Woods and Krantz [8℄tell us that "In addition to all the onne tions from V1 and V2 to V3, V4 and V5, ea h ofthese regions onne ts ba k to V1 and V2. These seemingly ba kward or reentrant onne -tions are not well understood. � � � Information, instead of owing in one dire tion, in bothdire tions. Thus, later levels do not simply re eive information and send it forward, but arein an intimate two-way ommuni ation with other modules." Of ourse, it is not that infor-mation owed unidire tionally in the visual system until some time in the 1980's and thenbegan to ow bidire tionally. Rather, as is so often the ase in s ien e, measurements madepossible by new instrumentation and methodologies have demanded that ertain herishedparadigms be seriously revised. In this ase, those mistaken paradigms espoused so- alled\bottom-up" unidire tionality of signaling pathways in the human visual system (HVS) [6℄[5℄.Instead of speaking about feedforward and feedba k signaling, neuros ientists refer tobottom-up and top-down signaling, respe tively. Roughly speaking, neurons whose axons arry signals prin ipally in a dire tion that moves from the sensory organs toward the \topbrain" are alled bottom-up neurons, while those whose axons propagate signals from thetop brain ba k toward the sensory organs are alled top-down neurons. Re ent measure-ments have revealed that there are roughly as many top-down neurons in the HVS as thereare bottom-up neurons. Indeed, nested feedba k loops operate at the lo al, regional andglobal levels. 2 We shall see that a theory of sensory per eption whi h embra es ratherthan es hews feedba k reaps rewards in the form of analyti al results that are both simplerto obtain and more powerful in s ope.8 Neurons and CoalitionsThe roughly 1010 neurons in the human visual system (HVS) onstitute ir a one-tenth ofall the neurons in the brain. HVS neurons are dire tly inter onne ted with one another viaan average of 104 synapses per neuron. That is, a typi al HVS neuron has on its dendriti tree about 104 synapses at ea h of whi h it taps o� the spike signal propagating along theaxon of one of the roughly 104 other HVS neurons that are a�erent (i.e., in oming) to it. Viapro essing of this multidimensional input in a manner to be dis ussed below, it generatesan e�erent (i.e., outgoing) spike train on its own axon whi h propagates to the ir a 104other HVS neurons with whi h it is in dire t onne tion. The 1010 � 1010 matrix whose(i; j) entry is 1 if neuron i is a�erent to neuron j and 0 otherwise thus has a 1's densityof 10�6. However, there exist spe ial subsets of the HVS neurons that are onne ted mu hmore densely than this. These spe ial subsets, among whi h are the ones referred to as V1,V2 : : : V5 in the above quote, onsist of a few million to as many as a few tens of millionsof neurons and have onne tivity submatri es whose 1's densities range from 0.1 to as mu h2Shannon's topi for the inaugural Shannon Le ture in June 1973 was Feedba k. Biologi al onsiderations,one of his many interests [3℄, may have in part motivated this hoi e.8

Page 9: [Berger, T] Living Information Theory

as 0.5. Common sense suggests and experiments have veri�ed that the neurons omprisingsu h a subset work together to e�e t one or more ru ial fun tions in the pro essing of visualsignals. We shall hen eforth refer to su h subsets of neurons as \ oalitions". Alternativenames are for them in lude neural \regions", \groupings", \ ontingents", and \assemblies".Figure 1 shows a s hemati representation of a neural oalition. Whereas real neuralspike trains o ur in ontinuous time and are asyn hronous, Figure 1 is a time-dis retemodel. Its time step is ir a 2.5 ms, whi h orresponds to the minimal duration betweensu essive instants at whi h a neuron an generate a spike; spikes are also known as ana tion potentials. The time-dis rete model usually aptures the essen e of the oalition'soperation as regards information transfer. Any spike traveling along an axon a�erent to a oalition of neurons in the visual ortex will rea h all the members of that oalition withinthe same time step. That is, although the leading edge of the spike arrives serially at thesynapses to whi h it is a�erent, the result is as it were multi asted to them all during asingle time slot of the time-dis rete model. 3The input to the oalition in Figure 1 at time k is a random binary ve tor X(k) whi hpossesses millions of omponents. Its ith omponent, Xi(k) is 1 if a spike arrives on the ithaxon a�erent to the oalition during the kth time step and is 0 otherwise. 4 The a�erentneurons in Figure 1 have been divided into two groups indexed by BU and TD, standingrespe tively for bottom-up and top-down. The verti al lines represent the neurons of the oalition. The presen e (absen e) of dark dot where the ith horizontal line and mth verti alline ross indi ates that the ith a�erent axon forms (does not form) a synapse with themth neuron of the oalition. The strength, or weight, of synapse (i;m) will be denoted byWim; if a�erent axon i does not form a synapse with oalition neuron m, then Wim = 0.If Wim > 0, the onne tion (i;m) is said to be ex itatory; if Wim < 0, the onne tion isiinhibitory. In primate visual ortex about �ve-sixths of the onne tions are ex itatory.The so- alled post-synapti potential (PSP) of neuron m is built up during time stepk as a weighted linear ombination of all the signals that arrive at its synapses during thistime step. If this sum ex eeds the threshold Tm(k) of neuron m at time k, then neuronm produ es a spike at time k and we write Ym(k) = 1; if not, then Ym(k) = 0. Thethresholds do not vary mu h with m and k with one ex eption. If Ym(k) = 1, a refra toryperiod of duration about equal to a typi al spike width follows during whi h the threshold isextremely high, making it virtually impossible for the neuron to spike. In real neurons, thePSP is reset to its rest voltage after a neuron spikes. One short oming of our time-dis retemodel is that it assumes that a neuron's PSP is reset between the end of one time stepand the beginning of the next even if the neuron did not �re a spike. In reality, if the peakPSP during the previous time step did not ex eed threshold and hen e no a tion potentialwas produ ed, ontributions to this PSP will partially arry over into the next time step.Be ause of apa itative leakage they generally will have de ayed to one-third or less of theirpeak value a time step ago, but they will not have vanished entirely.3The time-dis rete model mirrors reality with insuÆ ient a ura y in ertain saturated or near-saturated onditions hara terized by many of the neurons in a oalition spiking almost as fast as they an. Su hinstan es, whi h o ur rarely in the visual system but relatively frequently in the auditory system, are hara terized by su essive spikes on an axon being separated by short, nearly uniform durations whosesample standard deviation is less than 1 millise ond. Mathemati al methods based on Poisson limit theorems(a.k.a. mean �eld approximations) and PDE's an be used to distinguish and quantify these ex eptional onditions [25℄ [26℄.4Most neuros ientists today agree that the detailed shape of the a tion potential spike as a fun tion oftime is in onsequential for information transmission purposes, all that matters being whether there is or isnot a spike in the time slot in question. 9

Page 10: [Berger, T] Living Information Theory

Every time-dis rete system diagram must in lude a unit delay element in order to allowtime to advan e. In Figure 1 unit delays o ur in the boxes marked �. Note, therefore,that the random binary ve tor Y (k�1) of spikes and non-spikes produ ed by the oalition'sneurons during time step k�1 gets fed ba k to the oalition's input during time step k. Thisre e ts the high inter onne tion density of the neurons in question that is responsible forwhy they onstitute a oalition. Also note that, after the spike trains on the axons of ea hneuron in the oalition are delivered to synapses on the dendrites of sele ted members ofthe oalition itself, they then pro eed in either a top-down or a bottom-up dire tion towardother HVS oalitions. In the ase of a densely onne ted oalition, about half of its e�erentaxons' onne tions are lo al ones with other neurons in the oalition. On average, aboutone quarter of its onne tions provide feedba k, mostly to the oalition immediately belowin the HVS hierar hy although some are dire ted farther down. The remaining quarterare fed forward to oalitions higher in the hierar hy, again mainly to the oalition dire tlyabove. This elu idates why we distinguished the external input X to the oalition as being omprised of both a bottom-up (BU) and a top-down (TD) subset; these subsets ome,respe tively, mainly from the oalitions dire tly below and dire tly above.Neuros ientists refer not only to bottom-up and top-down onne tions but also to hor-izontal onne tions [15℄, [16℄, [17℄. Translated into feedforward-feedba k terminology, hori-zontal onne tions are lo al feedba k su h as Y (k � 1) in Figure 1, top-down onne tionsare regional feedba k, and bottom-up onne tions are regional feedforward. Bottom-upand top-down signals also an be onsidered to onstitute instan es of what informationtheorists all side information. 5In information theory parlan e, the neural oalition of Figure 1 is a time-dis rete, �nite-state hannel whose state is the previous hannel output ve tor. At the hannel inputappear both the regional feedforward signal fXBU(k)g and the regional feedba k signalfXTD(k)g. However, there is no hannel en oder in the information-theoreti sense of thatterm that is able to operate on these two signals in whatever manner suits its fan y in orderto generate the hannel input. Rather, the omposite binary ve tor pro ess fX(k)g :=f(XBU(k);XTD(k))g simply enters the hannel by virtue of the axons arrying it beinga�erent to the synapses of the hannel's neurons. We shall subsequently see that there isno de oder in the information-theoreti sense either; as foretold, the system is oding-free.My use of the adje tive \ oding-free" is likely to rile both information theorists andneuros ientists - information theorists be ause they are deeply enamored of oding andneuros ientists be ause they are a ustomed to thinking about how an organism's neuralspike trains serve as oded representations of aspe ts of its environment. In hopes of notlosing both halves of my audien e at on e, allow me to elaborate. Certainly, sensory neurons'spike trains onstitute an en oding of environmental data sour es. However, unless theyexpli itly say otherwise, information theorists referring to oding usually mean hannel oding rather than sour e oding. Channel oding onsists of the intentional insertion of leverly sele ted redundant parity he k symbols into a data stream in order to provideerror dete tion and error orre tion apabilities for appli ations involving noisy storage ortransmission of data. I do not believe that the brain employs error ontrol oding (ECC). 65In [2℄ Shannon wrote, "Channels with feedba k from the re eiving to the transmitting point are a spe ial ase of a situation in whi h there is additional information available at the transmitter whi h may be usedas an aid in the forward transmission system."6There is a possibility that the brain employs a form of spa e-time oding, with the emphasis heavilyon spa e as opposed to time. Here, the spa e dimension means the neurons themselves, the ardinality ofwhi h dwarfs that of the pau ity of antennas that omprise the spa e dimension of the spa e-time odes10

Page 11: [Berger, T] Living Information Theory

9 Mathemati al Model of a Neural CoalitionIt's time for some mathemati al information theory. Let's see what has to happen in orderfor an organism to make optimum use of the hannel in Figure 1, i.e., to transmit informationthrough it at a rate equal to its apa ity. (Of ourse, we are not interested just in sending anyold information through the hannel - we want to send the \right" information through the hannel, but we shall temporarily ignore this requirement.) A hannel's apa ity dependson the level of resour es expended. What resour es, if any, are being depleted in the ourseof operating the hannel of Figure 1? The answer lies in the biology. Obviously, energy is onsumed every time one of the hannel's neurons generates an a tion potential, or spike.7 It is also true that when a spike arrives at a synapse lo ated on a dendrite of one of the hannel's neurons, energy usually is expended in order to onvert it into a ontribution tothe post-synapti potential. 8 This is e�e ted via a sequen e of ele tro hemi al pro essesthe end result of whi h is that vesi les ontaining neurotransmitter hemi als empty themfor transportation a ross the synapti left. This, in turn, either in reases or de reases thepost-synapti potential (equivalently, the post-synapti urrent), respe tively as the synapseis an ex itatory or an inhibitory one. The expe ted energy dissipated in the synapses inFigure 1 at time k therefore depends on X(k) and Y (k � 1), while that dissipated in the urrently under development for wireless ommuni ations. Think of it this way. In order to evolve more apable sensory systems, organisms needed to expand the temporal and/or the spatial dimensionality oftheir pro essing. Time expansion was not viable be ause the need to respond to ertain stimuli in only afew tens of millise onds pre luded employing temporal ECC te hniques of any power be ause these requirelong blo k lengths or long onvolutional onstraint lengths whi h impose una eptably long laten y. Pulsewidths on eivably ould have been narrowed (i.e., bandwidths in reased), but the width of a neural pulseappears to have held steady at ir a 2 ms over all spe ies over hundreds of millions of years, no doubt for avariety of ompelling reasons. (Certain owls' auditory systems have spikes only about 1 ms wide, but we arenot looking for fa tor of 2 explanations here.) The obvious solution was to expand the spatial dimension.Organisms have done pre isely that, relying on parallel pro essing by more and more neurons in order toprogress up the phylogeni tree. If there is any ECC oding done by neurons, it likely is done spatiallyover the huge numbers of neurons involved. Indeed, strong orrelations have been observed in the spikingbehaviors of neighboring neurons, but these may simply be onsequen es of the need to obtain high resolutionof ertain environmental stimuli that are themselves inherently orrelated and/or the need to dire t ertainspike trains to more lo ations, or more widely dispersed lo ations, than it is pra ti al for a single neuron tovisit. There is not yet any solideviden e that neurons implement ECC. Similarly, although outside the domain of this paper, we remarkthat there is not yet any on rete eviden e that redundan ies in the geneti ode play an ECC role; if itturns out they do, said ECC apability learly also will be predominately spatial as opposed to temporal innature.7A tually, the energy gets onsumed prin ipally during the pro ess of re-setting hemi al on entrationsin and around the neuron after ea h time it spikes so as to prepare it to �re again should suÆ ient ex itationarrive. The a tual transmission of a spike is more a matter of energy onversion than of energy dissipation.8Spikes arriving at synapses often are ignored, a phenomenon known as quantal synapti failure (QSF). Itsname notwithstanding, QSF a tually is one of natural sele tion's �ner triumphs, enhan ing the performan eof neural oalitions in several ingenious respe ts the details of whi h an be found in the works of Levy andBaxter[13℄ [14℄. Let Sim(k) be a binary random variable that equals 1 if QSF does not o ur at synapse(i;m) at time k and equals 0 if it does; that is, Sim = 1 denotes a quantal synapti su ess at synapse (i;m)at time k. Often, the Sim(k)'s an be well modeled as Bernoulli-s random variables, i.e., as being i.i.d.over i, m and k with ommon distribution P (S = 1) = 1 � P (S = 0) = s; in pra ti e, s 2 [0:25; 0:9℄. Thephenomenon of QSF then may be modeled by multiplying the spike, if any, a�erent to synapse (i; m) at timek by Sim(k). This is seen to be equivalent to installing what information theorists all a Z- hannel [12℄ atevery synapse. Were it not for QSF, the oalition hannel would be e�re tively deterministi when viewed asan operator that transforms fXg into fY g, sin e the signal-to-noise ratio on neural hannels usually is quitestrong. However, if the hannel is viewed as an operator only from fXBUg to fY g, with fxTDg onsideredto be random side information, QSF may no longer be its dominant sour e of randomness.11

Page 12: [Berger, T] Living Information Theory

axons depends on Y (k). The average energy dissipated in the oalition at time k therefore isthe expe ted value of one fun tion of (X(k); Y (k�1)) and another fun tion of Y (k), with thee�e t of quantal synapti failures (see footnote on erning QSF) usually well approximatedby multiplying the �rst of these two fun tions by s. For purposes of the theorem we areabout to present, it suÆ es to make a less restri tive assumption that the average resour esexpended at time k are the expe ted value of some fun tion solely of (X(k); Y (k�1); Y (k)).We may impose either a s hedule of expe ted resour e depletion onstraints as a fun tion ofk or simply onstrain the sum of the kth expe ted resour e depletion over some appropriaterange of the dis rete time index k. Average energy expenditure, whi h we believe to bethe dominant operative onstraint in pra ti e, is an important spe ial ase of this generalfamily of resour e onstraint fun tions.Let PSPm(k) denote the post-synapti potential of the mth neuron in the oalition a timek. Then the output Ym(k) of this neuron, onsidered to be 1 if there is a spike at time kand 0 if there isn't, is given by Ym(k) = U(PSPm(k)� T );where U(�) is the unit step fun tion. The above dis ussion of Figure 1 lets us writePSPm(k) =Xl Xl(k)WlmQlm(k)Slm +Xi Yi(k � 1)WimQimSim(k);where Wim is the signed weight of synapse (i;m), Qim(k) is the random size of the quantityof neurotransmitter that will traverse the synapti left in response to an a�erent spike atsynapse (i;m) at time k if synapti quantal failure does not o ur there then, and Sim(k)equals 0 or 1, respe tively, in a ordan e with whether said quantal synapti failure doesor does not o ur. The spiking threshold T = Tm(k) of neuron m at time k varies with mand k, though usually only mildly.Note that this hannel model is su h that su essive hannel output ve tors Y (k) aregenerated independently, onditional on their orresponding input ve tors X(k) and lo alfeedba k ve tors Y (k � 1); that is,p(yn1 jxn1 ; y0) = nYk=1 p(ykjxk; yk�1): (1)As information theorists, one of our in linations would be to investigate onditions suÆ ientto ensure that su h a �nite-state hannel model with feedba k has a Shannon apa ity. Thatis, we might seek onditions under whi h the maximum over hannel input pro esses fX(k)gof the mutual information rate between said input and the output pro ess fY (k)g it gener-ates, subje t to whatever onstraints are imposed on the input and/or the output, equalsthe maximum number of bits per hannel use at whi h information a tually an be sentreliably over the onstrained hannel. Instead, we shall fo us on the mutual-information-rate-maximizing onstrained input pro ess and the stru ture of the joint (input,output)sto hasti pro ess it produ es.Temporarily assume that there is a genuine en oder at the hannel input whi h, forpurposes of generating input xk at time k remembers all the past inputs xk�11 and all thepast lo al feedba k (i.e., past output) values yk�10 ; here, y0 represents the \initial" state.Obviously, the maximum mutual information rate a hievable under these ir umstan es isan upper bound to that whi h ould be a hieved when only yk�1 is available at the hannel12

Page 13: [Berger, T] Living Information Theory

input at time k, with xk�11 and yk�20 by then no longer being available there. We will showthat this upper bound an be met even when a ess is denied to said past input and lo alfeedba k ve tors. This, in turn, helps demystify how a real neural network stru tured asin Figure 1 an be information-theoreti ally optimum despite its not possessing a lassi alen oder at the network's input.The postulated full-memory en oder an generate any input pro ess whose probabilisti stru ture is des ribed by a member of the set P(Xn1 ) of probabilisti distributions of theform nYk=1 p(xkjxk�11 ; yk�10 ): (2)Now de�ne another set of input distributions on Xn1 , P�(Xn1 ), having all probability massfun tions of the form nYk=1 p(xkjyk�1): (3)Compared with (2), this set ontains only those input distributions for whi h, given Y k�1,Xk be omes onditionally independent of all the previous inputs Xk�11 and all the previousoutputs Y k�20 .10 Statement and Proof of Main TheoremOur main result is stated as the following theorem.Theorem 1 The maximum mutual information rate between the hannel's input and outputpro esses is attained inside P�(Xn1 ), uniformly in the initial onditions Y 0. Moreover, ifwe restri t the distribution of the inputs Xn1 to P�(Xn1 ), let Y n1 denote the orrespondingoutput, and let Y 0 denote the initial hannel state, then we have1. fY k; k = 0; 1; : : : ; ng is a �rst-order Markov hain,2. f(Xk; Y k); k = 1; 2; : : : ; ng also is a �rst-order Markov hain.Remarks: (i) f(Xk)g is not ne essarily a �rst-order Markov hain, though we have not yetendeavored to onstru t an example in whi h it fails to be. Sin e fXkg depends, in part,on bottom-up information derived from the environment, it is unrealisti to expe t it toexhibit Markovianness, espe ially at pre isely the time step duration of the model. (ii) Thetheorem's Markovian results help explain how many neural regions an be hierar hi allysta ked, as is the ase in the human visual system, without ne essarily engendering una - eptably large response times. (iii) The theorem reinfor es a view of sensory brain fun tionas a pro edure for re ursively estimating quantities of interest in a manner that be omesin reasingly informed and a urate.Proof of Theorem 1 9We suppress underlining of ve tors and abuse notation by writing Xn1 2 P(Xn1 ) to indi atethat Xn1 is distributed a ording to some distribution in P(Xn1 ). Furthermore, the expres-sion (Xn1 ; Y n1 ) 2 P(Xn1 ; Y0) means that Xn1 2 P(Xn1 ) and Y n1 is the output that orrespondsto input Xn1 and hannel initial state Y0. First we establish the Markovianess of the outputpro ess.9This proof is joint work with Yuzheng Ying [7℄. 13

Page 14: [Berger, T] Living Information Theory

Lemma 1 If (Xn1 ; Y n1 ) 2 P�(Xn1 ; Y0), then Y n0 is a �rst-order Markov hain.Proof: For all k we havep(ykjyk�10 ) =Xxk p(ykjxk; yk�10 )p(xkjyk�10 ):Sin e Xn1 2 P�(Xn1 ), p(xkjyk�10 ) = p(xkjyk�1):Thus, with referen e to the onditional memoryless of the hannel ( f. equation (1), wehave p(ykjyk�10 ) =Xxk p(ykjxk; yk�1)p(xkjyk�1) = p(ykjyk�1): (4)Remark: Depending on the input pmf's p(xkjyk�1); k = 1; 2; : : :, Y n0 an be either a homo-geneous or a nonhomogeneous Markov hain. If pmf p(xkjyk�1) does not vary with k, thenY n0 is homogeneous; otherwise, it's nonhomogeneous.We next derive an upper bound on the mutual information between any (Xn1 ; Y n1 ) 2P(Xn1 ; Y0), whi h is needed for the proof.I(Xn1 ;Y n1 jY0) = H(Y n1 jY0)�H(Y n1 jXn1 ; Y0)(a)= nXk=1H(YkjY k�10 )� nXk=1H(YkjXn1 ; Y k�10 )(b)= nXk=1H(YkjY k�10 )� nXk=1H(YkjXk; Yk�1)( )� nXk=1H(YkjYk�1)� nXk=1H(YkjXk; Yk�1)= nXk=1 I(Xk;YkjYk�1); (5)where (a) is the hain rule; (b) follows from the hannel property p(ykjxn1 ; yk�10 ) =p(ykjxk; yk�1) for all k and all n > k; and ( ) follows from the fa t that in reasing ondi-tioning an only redu e entropy. Noti e that the inequality in ( ) be omes equality when Y n0is a Markov hain. Therefore, it follows from Lemma 1 that, for any (Xn1 ; Y n1 ) 2 P�(Xn1 ; Y0),I(Xn1 ;Y n1 jY0) = nXk=1 I(Xk;YkjYk�1): (6)We now show that, for any (Xn1 ; Y n1 ) 2 P(Xn1 ; Y0), there's a (Xn1 ; Y n1 ) 2 P�(Xn1 ; Y0)su h that I(Xn1 ; Y n1 jY0) � I(Xn1 ;Y n1 jY0): (7)This assertion says that In(Y0) is attained inside P�(Xn1 ). Sin e (Xn1 ; Y n1 ) is a pair of hannel (input,ouput) sequen es under the initial state Y0,pXn1 ;Y n1 jY0(xn1 ; yn1 jy0)= nYk=1 pXkjXk�11 ;Y k�10 (xkjxk�11 ; yk�10 )pYkjXk1 ;Y k�10 (ykjxk1 ; yk�10 )= nYk=1 pXkjXk�11 ;Y k�10 (xkjxk�11 ; yk�10 )pYkjXk;Yk�1(ykjxk; yk�1); (8)14

Page 15: [Berger, T] Living Information Theory

where the last equality follows from (1). We now onstru t a (Xn1 ; Y n1 ) 2 P�(Xn1 ; Y0)distributed a ording to the pmfpXn1 ;Y n1 jY0(xn1 ; yn1 ) = nYk=1 pXkjYk�1(xkjyk�1)pYkjXk;Yk�1(ykjxk; yk�1); (9)where we set PXk;Yk�1(�j�) equal to PXk;Yk�1(�j�)) so thatpXkjYk�1(xkjyk�1) = pXkjYk�1(xkjyk�1); 8k � 1 (10)and pYkjXk;Yk�1(ykjxk; yk�1) = pYkjXk;Yk�1(ykjxk; yk�1); 8k � 1: (11)Y0 therein is just an alias for the random variable Y0. Unlike Xk in (8), Xk is restri ted tobe onditionally independent of (Xk�11 ; Y k�20 ), given Yk�1. Thus, Xn1 2 P�(Xn1 ). Equation(11), together with Y0 = Y0, assures us that Y n1 is indeed the output from our hannel inresponse to input Xn1 and initial state Y0; i.e., (Xn1 ; Y n1 ) 2 P�(Xn1 ; Y0). It is obvious thatthe joint pmf of (Xn1 ; Y n0 ) is di�erent from that of (Xn1 ; Y n0 ). However, we have the followinglemma.Lemma 2 Assume (Xn1 ; Y n1 ) 2 P(Xn1 ; Y0). Let (Xn1 ; Y n1 ) be de�ned as in (9) and let Y0 =Y0. Then pYk;Xk;Yk�1(yk; xk; yk�1) = pYk;Xk;Yk�1(yk; xk; yk�1); 8k � 1: (12)Proof: It follows from (10) and (11) thatpYk;XkjYk�1(yk; xkjyk�1) = pYk;XkjYk�1(yk; xkjyk�1); 8k � 1: (13)Sin e Y0 = Y0, we may writepY1;X1;Y0(y1; x1; y0) = pY1;X1jY0(y1; x1jy0)pY0(y0)= pY1;X1jY0(y1; x1jy0)pY0(y0)= pY1;X1;Y0(y1; x1; y0): (14)That is, the lemma statement (12) holds for k = 1, from whi h it follows by marginalizationthat pY1(y1) = pY1(y1):The same arguments as in (14) now an be used to verify that (12) holds for k = 2.Repeating this argument for k = 3, and so on, establishes the desired result for all k � 1.Sin e (Xn1 ; Y n1 ) 2 P�(Xn1 ; Y0), we know from (6) thatI(Xn1 ; Y n1 jY0) = nXk=1 I(Xk; YkjYk�1):Next, re all from (5 that I(Xn1 ;Y n1 jY0) �Pnk=1 I(Xk;YkjYk�1), and observe from Lemma 2that I(Xk; YkjYk�1) = I(Xk;YkjYk�1); 8k � 1; (15)15

Page 16: [Berger, T] Living Information Theory

so nXk=1 I(Xk; YkjYk�1) = nXk=1 I(Xk;YkjYk�1): (16)Therefore, I(Xn1 ; Y n1 jY0) � I(Xn1 ;Y n1 jY0); whi h is (7).To show that the joint (input,output) pro ess is Markovian when Xn1 2 P�(Xn1 ), wewrite Pr(Xk; YkjXk�11 ; Y k�11 )= Pr(YkjXk;Xk�11 ; Y k�11 ) Pr(XkjXk�11 ; Y k�11 )(a)= Pr(YkjXk; Yk�1) Pr(XkjYk�1)= Pr(YkjXk;Xk�1; Yk�1) Pr(XkjXk�1; Yk�1)= Pr((Xk; YkjXk�1; Yk�1);where (a) follows from (1) and the ondition Xn1 2 P�(Xn1 ). Theorem 1 is proved.For a broad lass of onstraints on the hannel input and/or output, In(Y0) still isattained inside P�(Xn1 ). Spe i� ally, for all onstraints on expe ted values of fun tions oftriples of the form (Yk;Xk; Yk�1), imposed either as a s hedule of su h onstraints versusk or as sums or arithmeti avergages over k of fun tions of said triples, the onstrainedvalue of In(Y0) is attained by an input pro ess whose distribution onforms to (3). Tosee this, for any (Xn1 ; Y n1 ) 2 P(Xn1 ; Y0) satisfying one or more onstraints of this type,we onstru t (Xn1 ; Y n1 ) as in (9). By Lemma 2, (Xn1 ; Y n1 ) and (Xn1 ; Y n1 ) are su h thatfor ea h �xed k, (Xk; Yk; Y � k � 1) and (Xk; Yk; Yk�1) are identi ally distributed. Thisassures us that the expe ted value of any fun tion of (Xk; Yk; Yk�1) is the same as thatfor (Xk; Yk; Yk�1), so (Xn1 ; Y n1 ) also is admissible for the same values of the onstraints.Energy onstraints on the inputs and outputs are a spe ial ase. Thus, pro esses whi h ommuni ate information among the brain's neurons in an energy-eÆ ient manner willexhibit the Markovian properties ited in Theorem 1, at least to the degree of a ura y towhi h the model of Figure 1 re e ts reality.11 Review of Cost-Capa ity and Rate-DistortionIn preparation for addressing the topi of de oding, or the la k thereof, let us re all Shan-non's formulas hara terizing the probability distributions that solve the variational prob-lems for al ulating apa ity- ost fun tions of hannels and rate-distortion fun tions ofsour es.The ost- apa ity variational problem is de�ned as follows. We are given the transitionprobabilities fp(yjx); (x; y) 2 X � Yg of a dis rete memoryless hannel (dm ) and a setof nonnegative numbers f (x) � 0; x 2 Xg, where (x) is the ost in urred ea h time thesymbol x is inserted into the hannel. We seek the probability distribution fp(x); x 2 Xgthat maximizes the mutual information subje t to the onstraint that the average input ost does not ex eed S. We denote this maximum byC(S) := maxfp(x)g2SXx Xy p(x)p(yjx) log(p(yjx)=~p); (17)where S = ffp(x)g :Px p(x) (x) � Sg and~p(y) =Xx p(x)p(yjx): (18)16

Page 17: [Berger, T] Living Information Theory

C(S) is a on ave fun tion that usually satis�es C(0) = 0 and limS!1C(S) = C. The onstant C, alled either the un onstrained apa ity or simply the apa ity of the hannel,is �nite if X and/or Y have �nite ardinality but may be in�nite otherwise.The rate-distortion variational problem is de�ned as follows. We are given the letterprobabilities fp(u); u 2 Ug of a dis rete memoryless sour e (dms) and a set of nonnegativgenumbers fd(u; v) � 0; (u; v) 2 U � Vg. Here, d(u; v) measures the distortion that o urswhenever the dms produ es the letter u 2 U and the ommuni ation system delivers to a userlo ated at its output the letter v 2 V as its approximation of said u. The alphabets U andV may or may not be identi al. In fa t, the appropriate V and distortion measure vary fromuser to user. Alternatively, and more apropos of appli ation to a living organism, they varyover the di�erent uses a single organism has for the information. In what follows we thereforespeak of (sour e,use)-pairs instead of the usual terminology of (sour e,user)-pairs. In rate-distortion theory we seek the transition probability assignment fq(vju); (u; v) 2 U � Vgthat minimizes the average mutual information subje t to the onstraint that the averagedistortion does not ex eed D. We denote this minimum byR(D) := minfq(vju)2DXu Xv p(u)q(vju) log(q(vju)=q(v)); (19)where D = ffq(vju) :PuPv p(u)q(vju)d(u; v) � Dg andq(v) =Xu p(u)q(vju): (20)Viewed as a fun tion of D, R(D) is alled the rate-distortion fun tion. It is onvex onthe range [Dmin;Dmax℄, where Dmin = Puminv d(u; v) and Dmax = minvPu p(u)d(u; v).R(D) = 0 for D � Dmax and is unde�ned for D < Dmin. R(Dmin) equals the sour e entropyH = �Pu p(u) log p(u) if for ea h u 2 U there is a unique v 2 V, all it v(u), that minimizesd(u; v) and v(u) 6= v(u0 if u 6= u0; otherwise, R(Dmin) < H.In ea h of these variational problems, Lagrange optimization yields a ne essary onditionthat relates the extremizing distribution to the onstraint fun tion. For the ost- apa ityproblem this ondition, displayed as an expression for (x) in terms of the given hanneltransition probabilities and the information-maximizing fp(x)g, reads (x) = 1Xy p(yjx) log(p(yjx)=~p(y)) + 2; (21)where ~p(y) is given linearly in terms of said fp(x)g by (18). The onstant 2 representsa �xed ost per hannel use that simply translates the C(S) urve verti ally, so no lossin generality results from setting 2 = 0. The onstant 1 is inter hangeable with the hoi e of the logarithm base, so we may set 1 = 1, again without any essential loss ofgenerality. It follows that in order for optimality to prevail the ost fun tion must equalthe Kullba k-Leibler distan e 10 between the onditional distribution fp(yjx); y 2 Yg ofthe hannels output when the hannel input equals letter x and the un onditional outputdistribution f~p(y); y 2 Y obtained by averaging as in (18) over the information-maximizingfp(x); x 2 Xg. Sin e the expe tation over X of the K-L distan e between fp(yjX) andf~p(y)g is the mutual information between the hannels input and output, we see that10The K-L distan e, or relative entropy, of two probability distribution fp(w); w 2 Wg and fq(w); w 2 Wgis given by D(pjjq) := Pw p(w) log(p(w)=q(w)). It is nonnegative and equals 0 if and only if q(w) = p(w)for all w 2 W. 17

Page 18: [Berger, T] Living Information Theory

applying the onstraint Px p(x) (x) � S is equivalent, when the C(S)-a hieving inputdistribution is in for e, to maximizing the average mutual information subje t to the averagemutual information not ex eeding a spe i�ed amount, all it I. Obviously, this results inC(I) = I, a apa ity- ost fun tion that is simply a straight line at 45o.This perhaps onfusing state of a�airs requires further explanation, sin e informationtheorists are justi�ably not a ustomed to the apa ity- ost urve being a straight line. 11Sin e studying a well-known example often sheds light on the general ase, let us onsideragain Shannons famous formula C(S) = (1=2) log(1+S=N) for the apa ity- ost fun tion ofa time-dis rete, average-power-limited memoryless AWGN hannel; learly, it is a stri tly on ave fun tion of S. Of ourse, in this formula S is a onstraint on the average power ex-pended to transmit information a ross the hannel, not on the average mutual informationbetween the hannels input and the output. Next, re all that for this AWGN, whose tran-sition probabilities are given by p(yjx) = exp (y � x)2=2N=sqrt2�N , the optimum power- onstrained input distribution is known to be Gausswian, namely p(x) = expx2=2S=p2�S.When D(p(yjx)jj~p(y) is evaluated in the ase of this optimum input, it indeed turns out toindeed be proportional to x2 plus a onstant. Hen e, there is no di�eren e in this examplebetween onsidering the onstraint to be imposed on the expe ted value of X2 or onsid-ering it to be imposed on the expe ted value of D(p(yjX)jjp(y)). The physi al signi� an eof an average power onstraint is evident, but what, if anything, is the physi al meaningof an average K-L distan e onstraint? First observe that, if for some x it were to be the ase that fp(yjx); y 2 Y is the same as p(y); y 2 Y, then one would be utterly unable todistinguish on the basis of the hannel output between transmission and non-transmissionof the symbol x. Little if any resour es would need to be expended to build and operatea hannel in whi h fp(yjx); y 2 Yg does not hange with x, sin e the output of su h a hannel is independent of its input. In order to assure that the onditional distributions onthe hannel output spa e given various input values are well separated from one another,resour es must be expended. We have seen that in the ase of an AWGN the ability toperform su h dis riminations is onstrained by the average transmission power available;in a non-Gaussian world, the physi ally operative quantity to onstrain likely would besomething other than power. In this light I believe (21) is telling us, among others things,that if one is not sure a priori to what use(s) the information onveyed through the hanneloutput will be put, one should adopt the viewpoint that the task is to keep the variousinputs as easy to dis riminate from one another as possible subje t to whatever physi al onstraint(s) are in for e. We adopt this viewpoint below in our treatment of the de odingproblem.For the rate-distortion problem Shannon [4℄ observed that Lagrange minimization overfq(vju)g leads to the ne essary onditionqs(vju) = �s(u)qs(v) exp(sd(u; v)); (22)where s 2 [�1; 0℄ is a parameter that equals the slope R0(Ds) of the rate-distortion fun tionat the point (Ds; R(Ds)) that it generates, and fqs(v); v 2 Vg is an appropriately sele tedprobability distribution over the spa e V of sour e approximations. Sin e q(vju) must sumto 1 over v for ea h �xed u 2 U , we have�s(x) = [Xv qs(v) exp(sd(u; v))℄�1: (23)11This does happen for an AWGN hannel for all S << N0W , and hen e for all pra ti al values of S inthe ase of a time- ontinuous AWGN of extremely broad bandwidth.18

Page 19: [Berger, T] Living Information Theory

Ex ept in some spe ial but important examples, given a parameter value s it is diÆ ult to�nd the optimum fqs(v)g, and hen e the optimum fqs(vju) from (22).12 We an re ast (22)as an expression for d(u; v) in terms of qs(vju), namely 13d(u; v) = (�1=jsj) log(qs(vju)=qs(v)) + (1=jsj) log �s(u): (24)Sin e v does not appear in the se ond term on the right-hand side of (24), that term re- e ts only indire tly on the way in whi h a system that is end-to-end optimum in the senseof a hieving the point (Ds; R(Ds)) on the rate-distortion fun tion probabilisti ally re on-stru ts the sour e letter u as the various letters v 2 V. Re alling that log(qs(vju)=qs(v)) isthe mutual information is(u; v) between symbols u and v for the end-to-end (i.e., sour e-to-use) optimum system, we see that the right way to build said system is to make the mutualinformation between the letters of pairs (u; v) de rease as the distortion between them in- reases. Averaging over the joint distribution p(u)qs(vju) that is optimum for parametervalue s aÆrms the inverse relation between average distortion and average mutual informa-tion that pertains in rate-distortion. This relationship is analogous to the dire tly-varyingrelation between average ost and average mutual information in the hannel variationalproblem.12 Low-Laten y Multiuse De odingFor purposes of the present exposition, a key onsequen e of the pre eding paragraph is thatit helps explain how a hannel p(yjx) an be mat hed to many (sour e,use)-pairs at on e.Spe i� ally, under our de�nition hannel p(yjx) is mat hed to sour e p(u) and distortionmeasure d(u; v) at slope s on their rate-distortion fun tion if, and only if, there exists apair of onditional probability distributions fr(xju)g and fw(vjy)g su h that the optiumumend-to-end system transition probabilities fqs(vju)g in the rate-distortion problem an bewritten in the form qs(vju) =Xx Xy rs(xju)p(yjx)ws(vjy): (25)It should be lear that (25) often an hold for many (sour e,use) pairs that are of interest toan organism. In su h instan es it will be signi� antly more eÆ ient omputationally for theorganism to share the p(yjx) part of the onstru tion of the desired transition probabilityassignments for these (sour e,use) pairs rather than to have to in e�e t build and thenoperate in parallel a separate version of it for ea h of said appli ations. This will be all themore so the ase if it is not known until after p(yjx) has been exer ised just whi h potentialuses appear to be intriguing enough to bother omputing their w(vjy)-parts and whi h donot.I onje ture that the previous paragraph has mu h to say about why neural oalitionsare onstru ted and inter onne ted the way they are. Namely, the oalitionwise transitionprobabilities e�e ted are ommon to numerous potential appli ations only some of whi h12Kuhn-Tu ker theory tells us that the ne essary and suÆ ient ondition for fqs(v)g to generate, via(22), a point on the rate-distortion urve at whi h the slope is s is s(v) := Pu �s(u)p(u) exp sd(u; v) �1 for all v, where �s(u) is given by equation (23) and equality prevails for every v for whi h qs(v) >0. Re ursive algorithms developed by Blahut [19℄ and by Rose [20℄ allow rate-distortion fun tions to be al ulated numeri ally with great a ura y at moderate omputational intensity.13Equations (21) and (24) perhaps �rst appeared together in the paper by Gastpar et al. [18℄. Motivatedby exposure to my Examples 1 and 2, they derived onditions for double mat hing of more general sour esand hannels, on�ning attention to the spe ial ase of deterministi p(xju) and p(vjy).19

Page 20: [Berger, T] Living Information Theory

a tually get explored. The situation is sket hed s hemati ally in Figure 2, from whi h thereader an see how a given neural oalition, viewed as a hannel, might both be mat hedby the sour e that drives it and at the same time ould help mat h that sour e to manypotential uses via inter onne tion with other oalitions and sub oalitions.Whether or not use i, asso iated with some ith neural sub oalition des ribed by transi-tion probabilities fws;i(vijy; vi 2 Vig, gets a tively explored at a given instant depends onwhat side information in addition to part of fY kg gets presented to it. That side informationdepends, in turn, on Bayesian-style prior probabilities that are ontinually being re ursivelyupdated as the bottom-up and top-down pro essing of data from stimuli pro eeds. 14 Whensaid side information is relatively inhibitory rather than ex itatory, the subregion does not\ramp up. Then energy is saved but of ourse less information is onveyed. 15ACKNOWLEDGEMENTThe author is indebted to Professors William B. "Chip" Levy and James W. Mandell ofthe University of Virginia Medi al S hool's neurosurgery and neuropathology departments,respe tively, for frequent, lengthy, and far-ranging dis ussions of neuros ien e over the pastseven years. In his role as my neuros ien e mentor, Chip has gra iously enlightended meregarding his own and others' theoreti al approa hes to omputational neuros ien e. Thispresentation is in substantial measure a re e tion of Chip Levy's unrelenting pursuit ofneuros ienti� theory rooted in biologi al reality. This work was supported in part by NIHGrant RR15205 for whi h Professor Levy is the Prin ipal Investigator.Referen es[1℄ C. E. Shannon, A Mathemati al Theory of Communi ation. Bell Syst. Te h. J., vol. 27,379-423, 623-656, July and O tober, 1948. (Also in Claude Elwood Shannon: Colle tedPapers, N. J. A. Sloane and A. D. Wyner, eds., IEEE Press, Pis ataway, NJ, 1993,5-83.)[2℄ C. E. Shannon, Channels with Side Information at the Transmitter. IBM J. Resear hand Development, 2, 289-293, 1958.14Re ursive estimation is the name of the game in sensory signal pro essing by neurons. A Kalmanformalism [21℄ is appropriate for this, subje t to ertain provisos. One is that it seems best to adopt the onditional error entropy minimization riterion [22℄ [23℄ [24℄ as opposed to, say, a minimum MSE riterion;this is in keeping with our view that an information riterion for as long as possible before spe ializing to amore physi al riterion asso iated with a parti ular use. Another is that the full Kalman solution requiresinverting matri es of the form I +MMT in order to update the onditional ovarian e matrix. Matrixinversion is not believed to be in the repertoire of mathemati al operations readily amenable to realizationvia neurons unless the e�e tive rank of the matrix M is quite low.15We remark that neurons are remarkably sensitive in this respe t. They idle at a mean PSP level thatis one or two standard deviations below their spiking threshold. In this mode they spike only o asionallywhen the random u tuations in their e�e tively Poisson synapti bombardment happen to bun h togetherin time to build the PSP up above threshold. However, a small per entage hange in the bombardment(e.g., a slightly in reased overall intensity of bombardment and/or in a shift toward a higher ex itatory-to-inhibitory ratio of the synapses being bombarded) an signi� antly in rease the spiking frequen y. 16 Giventhe predominantly positive feedba k among the members of a oalition, many of its members an be madeto ramp up their spiking intensities nearly simultaneously. This helps explain why oalitions of neural ortexexhibit dramati variations on a time s ale of several tens of millise onds in their rates of spiking and hen ein their rates of information transmission and of energy depletion.20

Page 21: [Berger, T] Living Information Theory

[3℄ C. E. Shannon, An Algebra for Theoreti al Geneti s, Ph.D. Dissertation, Departmentof Mathemati s, MIT, Cambridge, Massa husetts, April 15, 1940.[4℄ C. E. Shannon, Coding Theorems for a Dis rete Sour e with a Fidelity Criterion. IREConvention Re ord, Vol. 7. 142-163, 1959. (Also in Information and De ision Pro esses.R. E. Ma hol, ed. M Graw-Hill, In . New York, 1960, 93-126, and in Claude ElwoodShannon: Colle ted Papers, N. J. A. Sloane and A. D. Wyner, eds., IEEE Press,Pis ataway, NJ, 1993, 325-350.)[5℄ H. B. Barlow, Sensory me hanisms, the redu tion of redundan y, and intelligen e. In:The Me hanization of Thought Pro esses. National Physi al Laboratory SymposiumNo. 10, Her Majesty's Stationery OÆ e, London, pp. 537-559, 1958.[6℄ D.H. Hubel and T.N. Wiesel. Re eptive Fields, Bino ular Intera tion and Fun tionalAr hite ture in the Cat's Visual Cortex, J. Physiol., 160 154, 1962.[7℄ T. Berger and Y. Ying, Chara terizing Optimum (Input,Output) Pro esses for Finite-State Channels with Feedba k, submitted to ISIT 2003, Yokohama, Japan, June-July,2003.[8℄ C. B. Woods and J. H. Krantz, Le ture notes at Lemoyne University based on Chapter9 of Human Sensory Per eption: A Window into the Brain, 2001.[9℄ T. Berger, Rate Distortion Theory: A Mathemati al Basis for Data Compression,Prenti e-Hall, Englewood Cli�s, NJ, 1971.[10℄ F. Jelinek, Probabilisti Information Theory, M Graw-Hill, New York, 1968.[11℄ R. G. Gallager, Information Theory and Reliable Communi ation, Wiley, New York,1968.[12℄ T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York,1991.[13℄ W. B Levy and R. A Baxter, Energy EÆ ient Neural Codes. Neural Comp., 8, 531-543. (Also in: Neural Codes and Distributed Representations, L. Abbott and T. J.Sejnowski, Eds., MIT Press, Cambridge, MA, 105-117, 1999.[14℄ W. B. Levy and R. A. Baxter, Energy-EÆ ient Neuronal Computation Via QuantalSynapti Failures. J. Neuros ien e, 22, 4746-4755, 2002.[15℄ H. K. Hartline, H. G. Wagner, and F. Ratli�, Inhibition in the eyes of Limulus. J. Gen.Physiol., 39, 651-673.[16℄ A. H. Burkhalter, Corti al Cir uits for Bottom-Up and Top-Down Pro essing. Pre-sented in Symposium 3, Corti al Feedba k in the Visual System, Paper 3.2, Neuro-s ien e 2002 - So iety for Neuros ien e 32nd Annual Meeting, Orlando, FL, November2-7, 2002.[17℄ J. Bullier, The Role of Feedba k Corti al Conne tions: Spatial and Temporal Aspe ts.Presented in Symposium 3, Corti al Feedba k in the Visual System, Paper 3.3, Neuro-s ien e 2002 - So iety for Neuros ien e 32nd Annual Meeting, Orlando, FL, November2-7, 2002. 21

Page 22: [Berger, T] Living Information Theory

[18℄ M. Gastpar, B. Rimoldi and M. Vetterli, To Code or Not to Code. Preprint, EPFL,Lausanne, Switzerland, 2001.[19℄ R. E. Blahut, Computation of Channel Capa ity and Rate -Distortion Fun tions. IEEETrans. Inform. Theory, 18, 460-473, 1972.[20℄ K. Rose, A Mapping Approa h to Rate-Distortion Computation and Analysis. IEEETrans. Inform. Theory, bf 42, 1939-1952, 1996.[21℄ T. Kailath, Le tures on Wiener and Kalman Filtering, CISM Courses and Le turesNo. 140, Springer-Verlag, Vienna and New York, 1981.[22℄ H. L. Weidemann and E. B. Stear, Entropy analysis of parameter estimation. Info.And Control, 14, 493-506, 1969.[23℄ N. Minamide and P. N. Nikiforuk, Conditional entropy theorem for re ursive parameterestimation and its appli ation to state estimation problems. Int. J. Systems S i., 24,53-63, 1993.[24℄ M. Janzura and T. Koski, Minimum entropy of error prin iple in estimation. Info.S ien es, 79, 123-144, 1994.[25℄ A. Buon ore, V. Giorno, A G. Nobile and L. M. Ri iardi, A Neural Modeling Paradigmin the Presen e of Refra toriness. BioSystems, 67, 35-43, 2002.[26℄ T. Berger, Interspike Interval Analysis via a PDE. Preprint, Ele tri al and ComputerEngineering Department, Cornell University, Itha a, New York, 2002.

22