Self-control with spiking and non-spiking neural networks playing games

10
Self-control with spiking and non-spiking neural networks playing games Chris Christodoulou a, * , Gaye Banfield b , Aristodemos Cleanthous a a Department of Computer Science, University of Cyprus, 75 Kallipoleos Avenue, P.O. Box 20537, 1678 Nicosia, Cyprus b School of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, London WC1E 7HX, United Kingdom article info Keywords: Self-control Precommitment Iterated Prisoner’s Dilemma Reinforcement learning Spiking neural networks abstract Self-control can be defined as choosing a large delayed reward over a small immediate reward, while pre- commitment is the making of a choice with the specific aim of denying oneself future choices. Humans recognise that they have self-control problems and attempt to overcome them by applying precommit- ment. Problems in exercising self-control, suggest a conflict between cognition and motivation, which has been linked to competition between higher and lower brain functions (representing the frontal lobes and the limbic system respectively). This premise of an internal process conflict, lead to a behavioural model being proposed, based on which, we implemented a computational model for studying and explaining self-control through precommitment behaviour. Our model consists of two neural networks, initially non-spiking and then spiking ones, representing the higher and lower brain systems viewed as cooper- ating for the benefit of the organism. The non-spiking neural networks are of simple feed forward mul- tilayer type with reinforcement learning, one with selective bootstrap weight update rule, which is seen as myopic, representing the lower brain and the other with the temporal difference weight update rule, which is seen as far-sighted, representing the higher brain. The spiking neural networks are implemented with leaky integrate-and-fire neurons with learning based on stochastic synaptic transmission. The dif- ferentiating element between the two brain centres in this implementation is based on the memory of past actions determined by an eligibility trace time constant. As the structure of the self-control problem can be likened to the Iterated Prisoner’s Dilemma (IPD) game in that cooperation is to defection what self- control is to impulsiveness or what compromising is to insisting, we implemented the neural networks as two players, learning simultaneously but independently, competing in the IPD game. With a technique resembling the precommitment effect, whereby the payoffs for the dilemma cases in the IPD payoff matrix are differentially biased (increased or decreased), it is shown that increasing the precommitment effect (through increasing the differential bias) increases the probability of cooperating with oneself in the future, irrespective of whether the implementation is with spiking or non-spiking neural networks. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction Self-control arises out of a desire to control one’s behaviour. In psychology, to exercise self-control is to inhibit an impulse to en- gage in behaviour that violates a moral standard (Morgan et al., 1979). Problems in exercising self-control occur when there is a lack of willingness or motivation to carry out this inhibition. This suggests a cognitive versus a motivational conflict. The motiva- tional problem suggested by problems in exercising self-control can be interpreted as: we know what is good for us (cognition), but we do not do it (motivation). The distinction between cogni- tion and motivation has been likened to the distinction between the higher and lower brain functions representing the frontal lobes and the limbic system respectively (Bjork et al., 2004). This sug- gests that self-control involves a conflict between cognition and motivation (Rachlin, 1995), a far-sighted planner and a myopic doer (Thaler and Shefrin, 1981), reason and passion, and is not just a case of changing tastes. These two extremes attain different value systems through experience (Scheier and Carver, 1988) and give rise to interpersonal conflict. Self-control problems stem from such a conflict. They also arise from a conflict at any single point in time of the choices we have available now and our future choices, and occur because our preferences for available choices are inconsis- tent across time (Ainslie, 1975; Loewenstein, 1996). More specifi- cally Rachlin (1995) defines self-control as choosing a large-later (LL) reward over a smaller-sooner (SS) reward. Studies in self-con- trol have found that increasing the delay of the reward, referred to as the delay of gratification (i.e., waiting for a more appropriate time and place to gain a reward), decreases the discounted value of the reward (Mischel et al., 1989). As the reward SS is imminent though, the discounted value of SS is greater than the discounted value of LL, so the person prefers the reward SS over the reward 0928-4257/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.jphysparis.2009.11.013 * Corresponding author. Tel.: +357 22892752; fax: +357 22892701. E-mail addresses: [email protected] (C. Christodoulou), [email protected] (G. Banfield), [email protected] (A. Cleanthous). Journal of Physiology - Paris 104 (2010) 108–117 Contents lists available at ScienceDirect Journal of Physiology - Paris journal homepage: www.elsevier.com/locate/jphysparis

Transcript of Self-control with spiking and non-spiking neural networks playing games

Page 1: Self-control with spiking and non-spiking neural networks playing games

Journal of Physiology - Paris 104 (2010) 108–117

Contents lists available at ScienceDirect

Journal of Physiology - Paris

journal homepage: www.elsevier .com/locate / jphyspar is

Self-control with spiking and non-spiking neural networks playing games

Chris Christodoulou a,*, Gaye Banfield b, Aristodemos Cleanthous a

a Department of Computer Science, University of Cyprus, 75 Kallipoleos Avenue, P.O. Box 20537, 1678 Nicosia, Cyprusb School of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, London WC1E 7HX, United Kingdom

a r t i c l e i n f o a b s t r a c t

Keywords:Self-controlPrecommitmentIterated Prisoner’s DilemmaReinforcement learningSpiking neural networks

0928-4257/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.jphysparis.2009.11.013

* Corresponding author. Tel.: +357 22892752; fax:E-mail addresses: [email protected] (C. Christo

(G. Banfield), [email protected] (A. Cleanthous).

Self-control can be defined as choosing a large delayed reward over a small immediate reward, while pre-commitment is the making of a choice with the specific aim of denying oneself future choices. Humansrecognise that they have self-control problems and attempt to overcome them by applying precommit-ment. Problems in exercising self-control, suggest a conflict between cognition and motivation, which hasbeen linked to competition between higher and lower brain functions (representing the frontal lobes andthe limbic system respectively). This premise of an internal process conflict, lead to a behavioural modelbeing proposed, based on which, we implemented a computational model for studying and explainingself-control through precommitment behaviour. Our model consists of two neural networks, initiallynon-spiking and then spiking ones, representing the higher and lower brain systems viewed as cooper-ating for the benefit of the organism. The non-spiking neural networks are of simple feed forward mul-tilayer type with reinforcement learning, one with selective bootstrap weight update rule, which is seenas myopic, representing the lower brain and the other with the temporal difference weight update rule,which is seen as far-sighted, representing the higher brain. The spiking neural networks are implementedwith leaky integrate-and-fire neurons with learning based on stochastic synaptic transmission. The dif-ferentiating element between the two brain centres in this implementation is based on the memory ofpast actions determined by an eligibility trace time constant. As the structure of the self-control problemcan be likened to the Iterated Prisoner’s Dilemma (IPD) game in that cooperation is to defection what self-control is to impulsiveness or what compromising is to insisting, we implemented the neural networks astwo players, learning simultaneously but independently, competing in the IPD game. With a techniqueresembling the precommitment effect, whereby the payoffs for the dilemma cases in the IPD payoffmatrix are differentially biased (increased or decreased), it is shown that increasing the precommitmenteffect (through increasing the differential bias) increases the probability of cooperating with oneself inthe future, irrespective of whether the implementation is with spiking or non-spiking neural networks.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Self-control arises out of a desire to control one’s behaviour. Inpsychology, to exercise self-control is to inhibit an impulse to en-gage in behaviour that violates a moral standard (Morgan et al.,1979). Problems in exercising self-control occur when there is alack of willingness or motivation to carry out this inhibition. Thissuggests a cognitive versus a motivational conflict. The motiva-tional problem suggested by problems in exercising self-controlcan be interpreted as: we know what is good for us (cognition),but we do not do it (motivation). The distinction between cogni-tion and motivation has been likened to the distinction betweenthe higher and lower brain functions representing the frontal lobesand the limbic system respectively (Bjork et al., 2004). This sug-

ll rights reserved.

+357 22892701.doulou), [email protected]

gests that self-control involves a conflict between cognition andmotivation (Rachlin, 1995), a far-sighted planner and a myopicdoer (Thaler and Shefrin, 1981), reason and passion, and is not justa case of changing tastes. These two extremes attain different valuesystems through experience (Scheier and Carver, 1988) and giverise to interpersonal conflict. Self-control problems stem from sucha conflict. They also arise from a conflict at any single point in timeof the choices we have available now and our future choices, andoccur because our preferences for available choices are inconsis-tent across time (Ainslie, 1975; Loewenstein, 1996). More specifi-cally Rachlin (1995) defines self-control as choosing a large-later(LL) reward over a smaller-sooner (SS) reward. Studies in self-con-trol have found that increasing the delay of the reward, referred toas the delay of gratification (i.e., waiting for a more appropriatetime and place to gain a reward), decreases the discounted valueof the reward (Mischel et al., 1989). As the reward SS is imminentthough, the discounted value of SS is greater than the discountedvalue of LL, so the person prefers the reward SS over the reward

Page 2: Self-control with spiking and non-spiking neural networks playing games

C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117 109

LL and thus we have reversal of preferences. In self-control problemsthe conflict arises out of this reversal of preferences between thosechoices available immediately (SS) and those available at sometime later (LL). Reversal of preferences is seen in experiments onhuman subjects (Solnick et al., 1980; Millar and Navarick, 1984).To give an example where self-control behaviour is exercised, con-sider a student and let the LL represent obtaining good grades andthe SS going to the pub. If we are at the start of the academic year,for most students the value of getting good grades exceeds that ofgoing to the pub. When invited to the pub however, the value of SSis higher than their long term goal of getting good grades (LL). If thestudent exercises self-control then he or she will choose study (LL)over the pub (SS). Self-control behaviour encompasses a resistanceto temptation, in this case to go to the pub (SS). Even though thisview of self-control has been criticised as being a too simplisticrepresentation of self-control in real life, as it models the situationonly where the rewards are mutually exclusive and discrete (Mele,1995; Plaud, 1995), we use it for our modelling in this paper, as itgives a clear preference for one alternative to another. It has to alsobe noted that the brain’s ability to recognise or predict rewards isbuilt in according to experiments by Richmond et al. (2003).

According to Ariely and Wertenbroch (2002) and Rachlin(2000), we recognise that we have self-control problems and tryto solve them by precommitment behaviour. Precommitmentbehaviour can be seen as a desire by people to protect them-selves against a future lack of willpower. Results by Ariely andWertenbroch (2002) from a series of experiments on college studentsshowed that we recognise that we have self-control problems, andattempt to control them by setting costly deadlines. These dead-lines help to control procrastination, but are not as effective asexternally imposed deadlines. Precommitment is more formally de-fined as making a choice now with the specific aim of denying (orat least restricting) oneself future choices (Rachlin, 1995). A typicalexample of precommitment is putting an alarm clock away fromyour bed, to force you to get up to turn it off. There are differentlevels of precommitment, which determine how successful theprecommitment will be. According to Nesse (2001) precommit-ment is either (i) conditional, e.g., a threat, or (ii) unconditional,e.g., a promise. As he states, the carrying out of precommitmentor not depends on how it is enforced. If the precommitment behav-iour is secured, i.e., is enforced by the situation or a third party, thenthere is a greater degree of certainty that the behaviour will be car-ried out. If the precommitment behaviour is unsecured, i.e., it de-pends on the individual’s emotion or reputation, then it is lesscertain that the precommitment behaviour will be carried out.

The internal process conflict suggested by self-control as de-scribed above, lead to a behavioural model being proposed

Higher Brain(cognition)

AgentEnvironment1

2

3

State

Action

Stimulus

Lower Brain(motivation)

Envir

Ac

Stim

Higher Brain(cognition)

AgentEnvironment1

2

3

State

Action

Stimulus

Lower Brain(motivation)

Envir

Ac

Stim(a)

Fig. 1. A model of self-control behaviour. (a) Self-control as an internal process, from thcomes into the cognitive system (Arrow 1). This combines with the messages from the loSS, which results in behaviour (Arrow 2). We are then rewarded with SS or LL (Arrow 3) (bathe higher and lower brain centres modelled as two neural network players competing inpast and current about the environment; the Action (Arrow 2) is the emergent behaviour oor penalty signal as a response to the Action (Arrow 2).

(Rachlin, 2000), based on which, we implemented a computationalmodel for studying and explaining self-control through precom-mitment behaviour (Banfield and Christodoulou, 2005). Our origi-nal model consisted of two simple feed forward multilayerperceptron type neural networks with reinforcement learning, rep-resenting the higher and lower brain systems viewed as cooperat-ing for the benefit of the organism. In the latest version of themodel, which is also presented in this paper, the feed forward mul-tilayer perceptron type neural networks are replaced with two net-works of leaky integrate-and-fire (LIF) neurons using a learningscheme based on reinforcement of stochastic synaptic transmis-sion (Seung, 2003). As the structure of the self-control problemcan be likened to the Iterated Prisoner’s Dilemma (IPD) game,firstly in that cooperation is to defection what self-control is toimpulsiveness (Brown and Rachlin, 1999) and secondly in that aninterpretation of the IPD is that it demonstrates interpersonal con-flict (Kavka, 1991), we implemented the neural networks as twoplayers, learning simultaneously but independently, competing inthe IPD game. The IPD was also used to model cooperation behav-iour (Axelrod and Hamilton, 1981). Moreover based on our devel-oped technique resembling precommitment, whereby the payoffsfor the dilemma cases in the IPD payoff matrix are differentiallybiased, the relationship between precommitment behaviour andthe value systems is also investigated.

2. Methods

2.1. The general model

In the viewpoint of modern cognitive neuroscience, self-controlas an internal process can be represented in a highly schematicway as in Fig. 1a (based upon Rachlin, 2000). Arrow 1 in Fig. 1a de-notes information coming into the cognitive system located in thehigher centre of the brain, which represents the frontal lobes asso-ciated with rational behaviour such as planning and control. Thisinformation combines with messages from the lower brain, repre-senting the limbic system (including memory from the hippocam-pus) that is associated with emotion and action selection (O’Reillyand Munakata, 2000; Rachlin, 2000). This travels back down to thelower brain and finally results in behaviour (Arrow 2 in Fig. 1a),which is rewarded or punished by stimuli entering the lower brain(Arrow 3 in Fig. 1a). In this paper, we implement the simple modelof Fig. 1a as an architecture of two interacting networks of neurons(Fig. 1b). We also make the theoretical premise that the higher andlower brain functions cooperate, i.e., work together, which is incontrast to the traditional view of the higher brain functioning asa controller overriding the lower brain. From this viewpoint, a

Agentonment1

2

3

State

tion

ulus

Player 1

Player 2

Agentonment1

2

3

State

tion

ulus

Player 1

Player 2

(b)

e viewpoint of modern cognitive neuroscience. Information of the temptation (SS)wer brain and memory of our larger-later reward (LL). A choice is made, either LL orsed upon Rachlin, 2000). (b) Our proposed computational model of self-control withthe Iterated Prisoner’s Dilemma. The State (Arrow 1) summarises information both

f the Agent (the combined networks), and the reinforcer (Arrow 3) is a global reward

Page 3: Self-control with spiking and non-spiking neural networks playing games

110 C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117

computational model of the neural cognitive system of self-controlbehaviour is developed. The schematic model of Fig. 1a is imple-mented in the first instance as two simple feed forward multilayerperceptron type networks (see Section 2.3) and afterwards as twonetworks of leaky integrate-and-fire neurons (see Section 2.4) sim-ulating two players, representing the higher and lower centres ofthe brain, competing against each other in the IPD game using rein-forcement learning (Fig. 1b). It is a network architecture of twonetworks exhibiting different behaviours to represent the higherversus lower cognitive functions, as depicted in Fig. 1a. The State(corresponding to Arrow 1 in Fig. 1a) summarises information bothpast and current about the environment; the Action (correspondingto Arrow 2 in Fig. 1a) is the emergent behaviour of the combinednetworks, and the reinforcer (corresponding to Arrow 3 in Fig. 1a)is a global reward or penalty signal as appropriate to the action.From this model of self-control behaviour, precommitment behav-iour can be viewed as resolving some internal conflict between thefunctions of the lower and the higher centres of the brain byrestricting or denying future choices and hence can be thought ofas resolving an internal conflict by prevention. It does this by bias-ing future choices to the larger, but later reward. By applying a dif-ferential bias to the payoff matrix of the IPD, the precommitmenteffect is simulated in our computational model (see Section 2.2).Complex processes like self-control cannot be understood simplyby the operations of individual neurons, it requires an understand-ing of the interaction of multiple components, i.e., networks ofneurons responsible for specific functions (Fodor, 1983; Jacobs,1999). According to O’Reilly and Munakata (2000), the higher cog-nitive functions are not based on the action of individual neuronsin a limited area, but on the outcome of the integrated action ofthe brain as a whole. For this reason, a holistic approach to model-ling the brain as a functionally decomposed system from a topdown perspective is adopted which is appropriate given the com-plexity and scope of the behaviour. In this paper, the model ex-plores the neural competition between modules (Jacobs, 1999).

Compromise Insist

2.2. Mapping the iterated Prisoner’s Dilemma to self-control andmodelling of precommitment

In its standard one-shot version, the scenario of the Prisoners’Dilemma (PD) game (Rappoport and Chammah, 1965) unfolds asfollows. Two people are arrested by the police under suspicion ofa crime. They are kept into separate rooms where the investigatorvisits each of them to offer the same deal: if one testifies for theprosecution against the other and the other remains silent, the be-trayer goes free and the silent accomplice receives a major convic-tion. If both remain silent, both prisoners are sentenced for a minorcharge. If each betrays the other, each receives a medium sentence.Each prisoner must make the choice of whether to betray the otheror to remain silent. Both care much more about their personal free-dom than about the welfare of their accomplice. However, neitherprisoner knows for sure what choice the other will make.

The PD is a game summarised by the payoff matrix of Fig. 2.There are two players Row and Column. Each player has the choiceof either ‘‘cooperate”(C) (remain silent in the prison example) or

Cooperate Defect

Cooperate R,R S,T Defect T,S P,P

Rule: T>R>P>S

Fig. 2. Payoff matrix for the Prisoner’s Dilemma Game defined by: Temptation todefect (T) must be better than the Reward for mutual cooperation (R), which mustbe better than the Punishment for mutual defection (P), which must be better thanthe Sucker’s payoff (S) (Rule: T > R > P > S) (see text for further description).

‘‘defect” (D) (betray the other). For each pair of choices, the payoffsare displayed in the respective cell of the payoff matrix of Fig. 2.Payoff for the Row player is shown first. R is the ‘‘reward” payoffgiven when both cooperate. P is the ‘‘punishment” that each re-ceives if both defect. T is the ‘‘temptation” that each receives ifone by his/her own defects and S is the ‘‘sucker’s” payoff thatone receives if he or she by his/her own cooperates. The only con-dition imposed to the payoffs is that they should be ordered suchthat T > R > P > S. Note that in general, game theory assumes ra-tional players in the sense that each player wants to maximisehis or her own payoff. In addition, each player knows the other isrational, knows that the other knows he or she is rational, etc. Ingame theoretical terms, DD is the only Nash equilibrium outcome(Nash, 1950), whereas the cooperative CC outcome is the only out-come that satisfies Pareto optimality (Fudenberg and Tirole, 1991).The ‘‘dilemma” faced by the players in any valid payoff structure isthat, whatever the other does, each one of them is better off bydefecting than cooperating. But the outcome obtained when bothdefect is worse for each one of them than the outcome they wouldhave obtained if both had cooperated.

The IPD is a game where the one-shot PD is played consecu-tively by two players. The design of the game requires an extra rulesuch that the cooperative outcome remains Pareto optimal.Namely, 2R > T + S guarantees that the players are not collectivelybetter off by having each player alternate between cooperate anddefect.

In order to map the PD to a real life situation and in the contextof self-control, let us consider a version of the student example(Section 1). A student faces a dilemma whether he or she shouldstay at home and finish a project that is to be submitted the follow-ing morning or go to the pub and celebrate a friend’s birthday.According to Kavka (1991), an interpretation of the game wouldbe that the student experiences an interpersonal conflict betweenhis academic-conscious self and his fun-conscious self. These twosubagents can either insist on getting their way or compromiseto a choice that benefits the organism as a whole. In the givenexample, the academic-conscious subagent can insist on stayinghome and studying throughout the night or compromise to achoice involving less studying. On the other hand, the fun-con-scious subagent can insist on partying throughout the night orcompromise to a less fun outcome. According to Kavka (1991),the two subagents correspond to the two players of the gamewhereas the actions ‘compromise’ and ‘insist’ correspond to the ac-tions ‘cooperate’ and ‘defect’ respectively. The interaction betweenthe two subagents can be represented in the matrix of Fig. 3. Kavka(1991) argues that an interpretation like this might provide a psy-chologically plausible picture of how internal conflict can lead to asuboptimal d outcome (see Fig. 3).

Based on Rachlin (2000) an interpretation of the game would bethat the lower and higher brain regions compete in a PD settingwhere each brain region can exhibit self-control in order to benefitthe organism as a whole or impulsiveness in order to satisfy its

Compromise c a Insist b d

Fig. 3. Payoff matrix for the Prisoner’s Dilemma game according to the interpre-tation by Kavka (1991), where the row player is the academic conscious side andthe column player is the fun conscious side and a–d correspond to the followingfour possible outcomes (mapped to the student example, Section 1): a – go to thepub and have fun; b – stay home and study; c – go to the pub for a quick drink andgo back home and study; d – do something different since none of the subagents iswilling to give up. Possible d outcomes could be staying at home but not be able tostudy or go to the pub but not having a good time because of the guilty feelings.

Page 4: Self-control with spiking and non-spiking neural networks playing games

Compromise/ Cooperate (C)

Insist/ Defect (D)

Compromise/ Cooperate (C)

4,4 -3,5 -ψ

Insist/ Defect (D)

5,-3 +ψ -2,-2

Fig. 5. Payoff matrix of the Prisoner’s Dilemma with a differential bias w modellingthe effect of precommitment (see text for more details). This payoff matrix servesalso as the General Payoff Matrix of the IPD in the case of spiking networks.

C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117 111

own ‘‘wants”. Given this interpretation, the two brain regions cor-respond to the two players of the game whereas the behaviour of‘self-control’ and ‘impulsiveness’ correspond to the actions ‘coop-erate’ and ‘defect’ respectively. In our example, the interaction be-tween the two brain regions of the student can be represented in amatrix identical to the one of Fig. 3 where the subagents’ actions‘Compromise’ and ‘Insist’ are now the brain regions’ behavioursof ‘Self-Control’ and ‘Impulsiveness’ respectively. The outcomesa–d remain the same as in Fig. 3.

Given both psychological interpretations of the game with re-spect to self-control, it all comes down to a PD competition be-tween two agents; under the first interpretation (Kavka, 1991)these agents correspond to two brain states whereas in the secondinterpretation (Rachlin, 2000) these agents correspond to twobrain regions. For the purposes of the current study, both versionsare acceptable since they do not differentiate the implementationof the game.

Subagents value the different outcomes in a different waywhich was the reason for the interpersonal conflict to arise in thefirst place. Possible values that each subagent could assign to eachoutcome can be summarised in the payoff matrix of the gameshown in Fig. 4. These values are not absolute in the sense that adifferent set of subagents might apply different values to the sameoutcomes. Thus the payoff matrix of Fig. 4 is just one of infinitepossible matrices. However, the structure of the payoffs of any gi-ven matrix should comply with the payoff rule governing the PD.

As explained in Section 1, precommitment is an exercisedbehaviour where an individual makes a choice at a point in timein order to deny himself or herself future actions. Precommitmentrequires that people know which of the alternatives is best forthem in the long run so that they precommit to the one with thehighest payoff. As already mentioned, the brain’s ability to recog-nise or predict future rewards is built in (Richmond et al., 2003)and past experience enhances this ability.

In order to model the effect of precommitment in our computa-tional model, it is assumed that the agent knows the long termpayoffs that result from the different outcomes. In our example,this would mean that the student knows the long term payoffs ofall a–d outcomes (see Fig. 3) and additionally we make theassumption that the long term payoff of submitting his or her workis greater than going to the pub (long term payoff from b is greaterthan long term payoff from a). This latter knowledge could havebeen acquired by the student by experiencing the satisfaction ofearning good grades through studying and not going to the puband the consequences of not submitting his or her work and goingto the pub at similar occasions. Therefore if the payoff matrix ofFig. 4 corresponds to the original payoff matrix (before experience)where both a and b outcomes (see Fig. 3) yield the same total pay-off, then the payoff matrix of Fig. 5 corresponds to the payoff ma-trix that develops through experience where the payoff from bshould be greater than the payoff from a since w is a non-negativevalue. Therefore we make the hypothesis that the payoff matrixwith the differential payoff w should induce a similar effect to thatof precommitment, which is the choice of the action that corre-sponds to the self-controlled behaviour, i.e., the choice of a morecooperative behaviour by both agents in the game.

Compromise/ Cooperate (C)

Insist/ Defect (D)

Compromise/ Cooperate (C)

4,4 -3,5

Insist/ Defect (D)

5,-3 -2,-2

Fig. 4. Payoff matrix of the Prisoner’s Dilemma game with possible values that eachsubagent assigns to each outcome.

2.3. Players as multilayer perceptron type networks withreinforcement learning

The neural model of self-control as described in Section 2.1 isimplemented as two players competing in a game-theoretical situ-ation (Fig. 1). More specifically, the higher and lower centres of thebrain are implemented initially as two simple feed forward multi-layer artificial neural networks (ANNs) using reinforcement learn-ing. The ANN representing the higher brain centre is implementedwith the Temporal Difference weight update rule (Sutton, 1988)with a look-up table, which maintains a history of previous re-wards and includes a discount rate used in determining the valueof future rewards. For these reasons, the Temporal Difference ruleis viewed as being far-sighted and thus associated with the higherbrain processes. This is consistent with planning and control ofhigher brain functions (Carver and Scheier, 1998). The ANN repre-senting the lower brain centre is implemented with the SelectiveBootstrap weight update rule (Widrow et al., 1973) which has nomemory of past rewards and no mechanism for estimating futurerewards, hence, can be viewed as myopic and as concerned withimmediate gratification.

In the Selective Bootstrap rule no explicit reinforcement signalis used; instead the actual output is used as the ‘‘desired output”and learning follows in a supervised format (Widrow et al.,1973). The equations used for rewarding and penalising the Selec-tive Bootstrap Network are shown in Appendix A.

Temporal Difference (TD) learning is used as a model of classicalconditioning in psychology (Sutton and Barto, 1998). TD is imple-mented in our study with a look-up table, which is in effect the va-lue function V(St). The state is the opponent’s last action, i.e., tocooperate or to defect and the action is the player’s response basedon the opponent’s last action. The Value function is the probabilityof receiving the highest payoff, given the current state of the envi-ronment and the agent’s action. The highest long term reward isachieved if both ANNs learn to cooperate. This is reflected in thelook-up table, which is in effect the Value Function. Fig. 6 showsthe initial values for the Value function for each state/action com-bination. We use Eq. (A.3) (see Appendix A) to update the look-ta-ble based on the probability of winning from the previous state(Sutton and Barto, 1998). The table is updated at the end of eachturn of the game. The Value function and the reinforcement signalare used to calculate the temporal difference error given by Eq.

State D/C

Action D/C

Value Function V(St)

D D 0 C C 1 D C 0.5 C D 0.5

Fig. 6. The look-up table for Temporal Difference learning in the IPD game with aglobal reward in the case of non-spiking neural networks. A table of initial valuesfor the learned Value function V, where D stands for defect and C for cooperate. Theinitial values are based on the fact that to gain the higher long term reward bothANNs must cooperate.

Page 5: Self-control with spiking and non-spiking neural networks playing games

112 C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117

(A.4) (Sutton and Barto, 1998) (see Appendix A). The reinforcementsignal is adapted from the payoff matrix of Smith (1982). The tem-poral difference error is used to update the weights as in Eq. (A.5)(Sutton and Barto, 1998) (see Appendix A).

The ANN is configured with two input nodes to represent theopponent’s previous action (a node to represent defection and anode to represent cooperation), and two output nodes representinga response (a node to represent defection and a node to representcooperation). The nodes act like a binary switch, i.e., a value of 1indicates that a node is active. For example, if the opponent’s pre-vious action was to defect, then the defection node would be setwith an input value of 1 and the node representing cooperationwith a value of zero. The output is normalized to either a valueof 1 or 0. A value of 1 in the output node indicates that this nodeis active. For example, if the ANN’s response is to cooperate thenthe value of the defection node is zero and of the node representingcooperation is 1. The goal is to maximise the net payoff. In the sys-tem configuration for the IPD game, the environment (see Fig. 1)contains a process that initializes the input/state to the opponent’sprevious action (to defect or to cooperate) at the start of eachround. The output/action is the ANN’s action (to defect or to coop-erate). The environment also contains a critic (the reinforcer; Arrow3 in Fig. 1b) that assigns a global reward or penalty based on theactions of both ANNs. Fig. 7 shows the payoff matrix used in thisexperiment, where the reward is global, i.e., both ANNs receivethe same reward at each round of the game, which is the payoffshown in the matrix of Fig. 7. It has to be noted that the payoff ma-trix in the self-control game by Brown and Rachlin (1999) used glo-bal rewards as well and violates, as in our case, the first rule of theIPD game, i.e., T > R > P > S (refer to Fig. 2, Section 2.2). The payoffmatrix is explained in terms of higher and lower brain regionsrather than row and column players. The reward is a global rewardto both ANNs, i.e., both ANNs get the same payoff. Although thepayoff matrix as shown in Fig. 7 violates the first rule of the IPDgame, it is similar to the payoff matrix used by Brown and Rachlin(1999) in the self-control game they used in their paper, whose re-sults we aim to emulate. The reward for mutual cooperation (CC) isthe highest at 2, as this is the desired behaviour, the punishmentfor mutual defection (DD) is the lowest at 0, the penalty for Suck-er’s payoff (CD) and the reward for temptation to defect (DC) bothhave a numerical value of 1. The reward for mutual defection is thelowest, representative of the cost of taking the smaller-sooner re-ward (SS) (see Section 1). Mutual cooperation (CC) yields the high-est reward in the long term, representative of the larger-laterreward (LL) and hence the highest payoff.

The ANNs are configured as 2-6-6-2 and all parameters are heldat 0.1. The bias was implemented as a node whose weight is train-able in the same way as the other nodes in the ANN. The task oflearning for this experiment is to find the best response based onthe opponent’s previous response; this is encoded in the hiddennodes. The best response in this case is to maximise the net payofffor the organism as a whole. The pattern of play, i.e., the sequenceof the ANN’s actions (i.e., to defect or to cooperate), the payoff forthe round and the accumulated payoff for the game were recorded.The number of rounds per game was held at 1000. A trial consists

Lower Lower Higher 2 (CC) 1 (CD) Higher 1 (DC) 0 (DD)

Fig. 7. Payoff matrix of the IPD game with global rewards in the case of non-spikingneural networks. The payoff matrix is explained in terms of higher and lower brainregions rather than row and column players. The reward is a global reward to bothANNs, i.e., both ANNs get the same payoff.

of three games of 1000 rounds. To avoid any first player advantageor disadvantage, the starting ANN is selected at random. The ANNsare rewarded or penalised at the end of each round. Since the pay-off will be the same for both ANNs, the networks are not rewardedor penalised at the end of the game. In the case where the differen-tial bias w is used, it is assigned a non-negative value at the begin-ning of a trial for both networks, and is fixed for the duration ofthat trial.

2.4. Players as networks of leaky integrate-and-fire neurons withreinforcement of stochastic synaptic transmission

The game simulation is repeated with the two players imple-mented by two spiking neural networks. The networks’ architec-ture is depicted in Fig. 8. The networks receive a common inputof 60 Poisson spike trains grouped in four neural populations. Eachnetwork has a hidden layer of 60 leaky integrate-and-fire (LIF) neu-rons and an output layer of 2 LIF neurons (for LIF neuron details,see Appendix B.1).

The networks learn simultaneously but separately where eachnetwork seeks to maximise its own accumulated reward. Learningis implemented through reinforcement of stochastic transmission(Seung, 2003). Seung makes the hypothesis that microscopic ran-domness is harnessed by the brain for the purposes of learning.The model of the hedonistic synapse is developed by Seung(2003) along this hypothesis (for implementation details, seeAppendix B.2). Briefly, within the framework of the model, eachsynapse acts as an agent who pursues reward maximisationthrough the actions of releasing or not a neurotransmitter upon ar-rival of a presynaptic spike. Each synapse keeps a record of its re-cent actions through a dynamical variable, the eligibility trace(Klopf, 1982). In order to capture differences in the way the twonetworks integrate time-related events, network I has an eligibilitytrace time constant (e) (see Appendix B.2) equal to 20 ms and net-work II equal to 2 ms.

During each learning round, the networks receive a common in-put of 60 Poisson spike trains grouped in four neural populationswhich encode the decisions the two networks had during the pre-vious round. For example, if at a given round network I chooses todefect (D) and network II to cooperate (C), then during the nextlearning round the networks will receive input that encodes thedefect–cooperate (DC) outcome. A learning round lasts as long asthe input is presented which is 500 ms. The decision of each net-work is encoded in the input, by the firing rate of two groups ofPoisson spike trains. The first group will fire at 40 Hz if the networkcooperated and at 0 Hz otherwise. The second group will fire at40 Hz if the network defected and at 0 Hz otherwise. Consequently,the total input to the networks during each round is represented byfour groups of Poisson neurons, two groups for each network,where each group fires at 40 Hz or 0 Hz accordingly. For any givenround there are always two groups of 40 Hz Poisson spike trains,preserving thus a balance at the firing rates of the output neuronsat the beginning of learning. Therefore, any significant difference inthe firing rate of the output neurons at any time should be inducedonly by learning and not due to differences in the firing rates of thedriving input. One can identify here a cyclic procedure which startswhen the networks decide, continues by feeding this informationto the networks during which learning takes place and ends by anew decision.

At the end of each learning round the networks decide whetherto cooperate or defect for the next round of the game. Decisions arecarried out according to the value that each network assigns to thetwo actions, and these values are reflected by the firing rates of theoutput neurons at the end of each learning round. The value ofcooperation for network I and II is taken to be proportional tothe firing rate of output neurons 1 and 3 respectively. Similarly,

Page 6: Self-control with spiking and non-spiking neural networks playing games

Fig. 8. Two Spiking Neural Networks of Hedonistic Synapses Compete in the IPD. Two individual networks with multilayer perceptron architecture receive a common inputby 60 neurons, depicted in the middle of the figure. Each network (left and right) has two layers of hedonistic synapses that make feed forward connections between threelayers of neurons; the 60 input neurons, 60 leaky integrate-and-fire (LIF) hidden neurons and 2 LIF output neurons. The networks have full connectivity, though only someconnections are shown for clarity. Neurons are randomly chosen to be either excitatory or inhibitory. The two networks simulate the corresponding two players of the game.

C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117 113

the value of defection for network I and II is taken to be propor-tional to the firing rate of output neurons 2 and 4 respectively.At the end of each learning round the firing rates of the competingoutput neurons are compared, for each network separately, and thedecisions are drawn.

When the two networks decide their play for the next round ofthe IPD, they each receive a distinct payoff given their actions andaccording to the payoff matrix of the game (Fig. 5). The payoff eachnetwork receives as a result of their combined actions at the previ-ous round of the game is also the global reinforcement signal thatwill train the networks during the next learning round and thusguide the networks to their next decisions. For example, if the out-come of the previous round was a CD then according to the payoffmatrix (with differential bias w equal to 0) network I should re-ceive a payoff of �3 for cooperating and network II a payoff of +5for defecting. During the next learning round network I receivesa penalty of �3 and network II a reward of +5. The reinforcementsignals are administered to the networks throughout the learninground as prescribed by the learning algorithm of hedonistic syn-apses. Each network was reinforced for every spike of their outputneuron that was ‘‘responsible” for the decision at the last roundand therefore for the payoff received. Hence in the CD case, net-work I would receive a penalty of �3 for every spike of output neu-ron 1 (remember that the firing rate of output neuron 1 reflects thevalue that network I has for the action of cooperation) and networkII would receive a reward of +5 for every spike of output neuron 4(remember that the firing rate of output neuron 4 reflects the valuethat network II has for the action of defection). The networks there-fore learn through global reinforcement signals which strengthenthe value of an action that elicited a reward and weaken the valueof an action that resulted to a penalty. When the differential bias(w) is different from 0 during a game, it is always added to bothnetworks’ payoffs for a DC outcome and subtracted from both pay-offs for a CD outcome.

In order to introduce competition between output neurons dur-ing a learning round, additional global reinforcement signals areadministered to the networks for every spike of the output neuronsthat were not ‘‘responsible” for the decision at the last round. In theCD case, an additional reward of +1.5 is provided to network I for

every spike of output neuron 2 and an additional penalty of �1.5is provided to network II for every spike of output neuron 3. Thevalue of the action that was not chosen by each network is there-fore also updated, by an opposite in sign reinforcement signal. Thevalue of �1.5 is chosen to be small enough such that firstly anychanges to the values of the players’ actions are primarily inducedby the reinforcement signals provided according to the payoff ma-trix of the game and secondly, such that this complementary rein-forcement signal would not cause a violation of the payoff rulesthat should govern the IPD.

Overall during a learning round, each network receives global,opposite in sign reinforcements for spikes of both of its output neu-rons. One of the two signals is due to the payoff matrix of the gameand it purpose is to ‘‘encourage” or ‘‘discourage” the action thatelicited reward or penalty and the other signal is complementaryand its purpose is to ‘‘encourage” or ‘‘discourage” the action thatcould have elicited reward or penalty if it had been chosen in theprevious round of the game (see Appendix B.3 for an overview ofthe reinforcement signals the networks receive during a learninground).

3. Results and discussion

3.1. Non-spiking neural networks: players as multilayer perceptrontype networks with reinforcement learning

For the system configuration described in Section 2.3 (non-spik-ing neural networks) a trial is repeated 3 times and the accumu-lated payoff and the average payoff for the three games arerecorded (the payoff in this experiment is the same for both ANNs,as both networks receive the same reward). Fig. 9 shows the netpayoff and the range, i.e., the minimum and maximum payoff foreach round. Only the first 150 rounds are shown, as even thougheach game was played for 1000 rounds, play continued in the sameway, that is the payoff continued to increase and play tended to becooperation. This is reflected in the patterns of play in the three tri-als as shown in Fig. 10.

For this case of the IPD game which is played with a global re-ward, i.e., both ANNs receive the same reward, play was symmetric

Page 7: Self-control with spiking and non-spiking neural networks playing games

0

500

1000

1500

2000

2500

3000

3500

4000

0 200 400 600 800 1000

Rounds Played

Acc

um

ula

ted

Pay

off

ψ = 0

ψ = 0.1

ψ = 0.9

Fig. 11. The effect of increasing the differential bias (precommitment effect) whichadded to the diagonal rewards of the payoff matrix in the IPD game in the case ofnon-spiking neural networks. When a differential bias w is applied to the diagonalrewards for asymmetric play, i.e., CD or DC, the results suggest that increasing wpromotes cooperation behaviour leading to the reward for mutual cooperation.

0

50

100

150

200

250

300

0 30 60 90 120 150

Rounds Played

Acc

um

ulat

ed P

ayo

ff

Fig. 9. Results for the TD Network and Selective Bootstrap Network (non-spiking)playing the IPD game where both networks receive the same reward showing thenet accumulated payoff (thick black line) for the IPD game with a global reward andthe range, i.e., the minimum and maximum payoff values for each round as theshaded area.

114 C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117

with cooperation the dominant behaviour from both ANNs. The re-sults show that the accumulated payoff is higher for both ANNsthan in an experiment with local rewards (Banfield, 2006), suggest-ing that the ANNs performed better in terms of desired behaviour,i.e., a global reward promotes mutual cooperation. In summary,with this arrangement (both ANNs receiving the same reward)there is less variability in the ANNs’ behaviour and the accumu-lated payoff increases, as opposed to the net payoff in the IPD witha local reward (Banfield, 2006). This can be explained as follows: inthe IPD game with a global reward there is a tendency for bothANNs to cooperate, hence the ANNs receive the higher reward formutual cooperation. In games of asymmetric play, i.e., one ANN de-fects at random and the other cooperates, the reward is less andhence the net payoff is less for the organism.

The next experiment was to add the differential bias, whichmodels the precommitment effect, to the payoff matrix of theIPD of Fig. 7 (i.e., for the system configuration described in Sec-tion 2.3). Fig. 11 compares the effect on the accumulated payoffof increasing the value of the differential bias w. The results showthat increasing the level of the differential bias w, implementedas described in Section 2.2, would seem to promote cooperationand hence the accumulated payoff increases, as the ANNs receivethe higher reward for mutual cooperation.

The differential bias would seem to not only to promote coop-eration, but also to address the internal conflict represented byeither the DC or CD situation. In particular, the middling behaviourof staying at home and working DC would appear to increase as thedifferential bias w approaches the upper limit of the range tested(0.9). This can perhaps be explained as follows: as the reward isglobal, i.e., both ANNs receive the same reward, the differentialbias affects both ANNs, hence increasing w increases the rewardfor the middling behaviour of DC bringing it closer to the rewardfor mutual cooperation, whilst at the same time decreasing the re-ward for the more negative behaviour of CD bringing it closer tothe reward for mutual defection. The result is that instead of fourclasses of rewards the organism is faced with just two classes of re-wards: one with a tendency for cooperation and one with a ten-dency for defection.

Fig. 10. Pattern of Play for an IPD Game with global reward in the case of non-spiking neural networks. Percentage breakdown by trial of a certain type of playwhere players receive the same reward; for example, for CC a value of 28 indicatesthat 28% of the time the networks played a game where both networks cooperate.

3.2. Spiking neural networks: players as networks of leaky integrate-and-fire neurons with reinforcement of stochastic synaptictransmission

For the system configuration described in Section 2.4 (spikingneural networks) a single game of the IPD consists of 100 roundsduring which the two networks seek to maximise their individualaccumulated payoff by cooperating or defecting at every round ofthe game. The purpose of the simulation is to investigate therequirements needed for the networks to exhibit a strong CCbehaviour with respect to the differential bias which is appliedon the payoff matrix. Since the IPD is linked to self-control in thatcooperation is to defection what self-control is to impulsiveness(Brown and Rachlin, 1999), a stronger CC behaviour would implyan increased self-controlled behaviour by the system. Additionally,according to Kavka’s interpretation of the game (Kavka, 1991), anindividual exhibits self-control when both subagents compromise,corresponding again to the CC outcome. Therefore the simulationsinvestigate the degree of self-control exhibited during a game withrespect to the differential bias applied on the payoff matrix. More-over, since precommitment has the effect of choosing the actionthat corresponds to the self-controlled behaviour, the simulationsalso investigate the degree of the precommitment effect with re-spect to the differential bias applied on the payoff matrix.

The results of three different simulations corresponding to threedifferent values of the differential bias are presented in Fig. 12. It isclearly shown that as the value of the differential bias increases,the accumulated payoff received by both networks increases. Mostimportantly, as the value of the differential bias increases, the de-gree of self-control increases since the percentage of the times thetwo networks cooperated also increased.

Effectively, the use of the differential bias alters the payoff ma-trix of the game and therefore the values that each player assignsto the altered outcomes of the game. The addition of the differen-tial bias to the payoffs of both players for the DC outcome increasesboth the sucker’s payoff for the Column player and the temptationpayoff for the Row player. However, the incentive to cooperatechanges in an opposite direction for the two players since thechoice to cooperate is not as ‘‘painful” for the Column player incase the Row player chooses to defect whereas the Row player be-comes more tempted to defect since the temptation payoff in-creases. Moreover, given Rachlin’s interpretation of the game(Rachlin, 2000), where the Row player corresponds to the higherbrain and the Column player corresponds to the lower brain, theeffect of the added differential payoff to the DC outcome would

Page 8: Self-control with spiking and non-spiking neural networks playing games

0

100

200

300

400

500

600

0 10 20 30 40 50 60 70 80 90 100Rounds Played

Acc

umul

ated

Pay

off

NO Differential Bias

Differential Bias 0.1

Differential Bias 0.9

DC = 28%

DD = 14% CC = 46%

CD = 12%

DD = 7%

DC = 24% CC = 49%

CD = 16%

DC = 30% DD = 7%

CC = 57%

CD = 6%

ψ = 0.9

ψ = 0.1

ψ = 0

Fig. 12. The effect of the differential bias (w) on the outcome of the IPD game whensimulated by two spiking neural networks. The networks seek to maximise theirindividual accumulated reward during 100 rounds of the game. The figure depictsthe total accumulated reward, gained by both networks, throughout the game. Thethree different plots correspond to three different values of the differential bias. Fora differential bias equal to zero, the networks accumulate a total payoff of 392, 368of which is due to the CC outcome since in 46 out of 100 rounds both networkschoose to cooperate. When the differential bias is equal to 0.1, the results improveslightly with a total accumulated payoff equal to 430, 392 of which is due to the CCoutcome since in 49 out of 100 rounds both networks choose to cooperate. Finallyfor a large differential bias equal to 0.9, the networks accumulate a significantlybigger total payoff equal to 543, 456 of which is due to the CC outcome since in 57out of 100 rounds both networks choose to cooperate.

C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117 115

be that the higher brain becomes more impulsive whereas the low-er brain becomes more self-controlled. On the other hand, the sub-traction of the differential bias from both players’ payoffs for theCD outcome decreases both the sucker’s payoff for the Row playerand the temptation payoff for the Column player. As a result, the

Row player becomes less motivated to cooperate as he or she facesan even less sucker’s payoff and the Column player less tempted todefect. With respect to the interpretation of the game, the higherbrain becomes less motivated to show self-control whereas thelower brain becomes less impulsive. Overall, the modification ofthe payoff matrix changes the incentive to cooperate for the Col-umn player in a positive way whereas for the Row player in a neg-ative way. Again, this would be mapped to an overall effect wherethe lower brain is more likely to exhibit self-control in order tobenefit the organism as a whole and the higher brain is more likelyto be impulsive so that it satisfies its own ‘‘wants”. This overall ef-fect is reflected in the results by an increase in the DC outcomewhich takes place as the value of differential bias increases. Inaddition, the increase of the CC outcome is a result of the greatercontribution by the Column player than by the Row player. There-fore in our case, the individual becomes more self-controlled be-cause of some alteration in the value system of its lower brainregion and not of its higher brain region.

4. Conclusions

In this paper we implemented a computational model of self-control with the higher and lower brain functions (representingthe frontal lobes and the limbic system respectively) implementedas players competing in the Iterated Prisoner’s Dilemma game. Inour model we have two implementation versions of the players.In the first version there are two non-spiking neural networks ofsimple feed forward multilayer type with reinforcement learning,one with selective bootstrap weight update rule representing thelower brain and the other with temporal difference weight updaterule representing the higher brain. In the second version, there aretwo spiking neural networks implemented with leaky integrate-and-fire neurons with learning based on stochastic synaptic trans-mission. The differentiating element between the two braincentres in this implementation is based on the memory of pastactions determined by an eligibility trace time constant. We havealso implemented a technique resembling the precommitment ef-fect, whereby the payoffs for the dilemma cases in the IPD payoffmatrix are differentially biased (increased or decreased). The re-sults showed that irrespective of whether the implementationwas with spiking or non-spiking neural networks, learning pro-motes cooperation between higher and lower brain centres andthus leads to greater self-control. In addition, it is shown thatincreasing the precommitment effect (through increasing the dif-ferential bias) increases the probability of cooperating with oneselfin the future. Thus precommitment further enhances cooperationbetween higher and lower brain and thus leads to even greaterself-control. This supports the empirical results of Baker and Rach-lin (2001), which showed that increasing the probability of recipro-cation increases cooperation. Effectively with our results one maysuggest that the implementation of the effect of precommitmentresults in behaviour similar to the probability of reciprocation inthe experiment of Baker and Rachlin (2001) and hence increasescooperation.

In addition, given that with both implementation versions ofthe model, i.e., with spiking and non-spiking neural networks,the main results are the same, it could be suggested that for study-ing complex high level behaviours (like self-control through pre-commitment), it is not worth increasing the complexity ofmodels by adding more biological realism.

We have also shown in this paper that neural networks (spikingor non-spiking) show a capacity for playing games along the linesof game theory in a way that resembles the behaviour of real play-ers. During most simulations, the networks managed to adapt tothe challenges of the game and make decisions according to the

Page 9: Self-control with spiking and non-spiking neural networks playing games

116 C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117

other player’s decisions in order to maximise their accumulatedpayoff. Most importantly, they ‘‘displayed intelligence” becausewhen the game flow allowed for the Pareto optimum solution tobe reached they ‘‘took advantage of the possibility” and settled tothe solution by choosing cooperation for the rest of the game.

Acknowledgements

We gratefully acknowledge the support of the University of Cy-prus for a Small Size Internal Research Programme Grant and theCyprus Research Promotion Foundation as well as the EuropeanUnion Structural Funds for Grant PENEK/ENISX/0308/82.

Appendix A. Formulae for the implementation of the players asmultilayer perceptron type networks with reinforcementlearning

Eq. (A.1) is used for rewarding a Selective Bootstrap Network ifthe action was deemed a success:

DwðtÞ ¼ g½aðtÞ � sðtÞ�xðtÞ ðA:1Þ

where time (t) is represented in terms of the number of completedrounds, a(t) is the actual output at time t used as ‘‘desired output”,s(t) is the sum of the weighted inputs, w(t) is the weight at time t, gis the learning rate and x(t) is the input at time t.

Eq. (A.2) is used for penalisation of the Selective Bootstrap Net-work if the action was deemed a failure:

DwðtÞ ¼ g½1� aðtÞ � sðtÞ�xðtÞ ðA:2Þ

where a(t) is the actual output (used as desired output at time t, s(t)the sum of the weighted inputs, w(t) the weight at time t, g thelearning rate and x(t) the input at time t.

Eq. (A.3) is used to update the value table in the TD Networkbased on the probability of winning from the previous state (Sut-ton and Barto, 1998):

VðSpÞ ¼ VðSpÞ þ a½VðScÞ � VðSpÞ� ðA:3Þ

where V(Sp) is the Value function of the state for the previous round(p) defined by the opponent’s previous action to either defect orcooperate and the network matching action (either to defect orcooperate). Similarly V(Sc) is the Value function of the state forthe current round (c). a is the step-size parameter (or learning rate).

Eq. (A.4) uses the Value function and the reinforcement signalto calculate the temporal difference error (Sutton and Barto, 1998):

dt ¼ rtþ1 þ cVðStþ1Þ � VðStÞ ðA:4Þ

where time t is represented by the number of rounds completed,rt+1 is the reinforcement signal at time t + 1, c is the discounting fac-tor of future rewards, V(St+1) is the Value function at time t + 1 andV(St) the Value function at time t.

Eq. (A.5) updates the weights in the TD Network using the tem-poral difference error (Sutton and Barto, 1998):

DwðtÞ ¼ adtxðtÞ ðA:5Þ

where a is the step-size parameter and x(t) is the input at time t tothe neuron.

Appendix B. Spiking neural network numerical simulations

B.1. Leaky integrate-and-fire neuron details

The leaky integrate-and-fire neurons (LIF) of the spiking neuralnetworks are modelled with the LIF equation:

CdVi

dt¼ �gLðVi � VLÞ �

X

j

GijðVi � EijÞ ðB:1Þ

where VL = �74 mV, gL = 25 nS and C = 500 pF giving a membranetime constant of s = 20 ms. The differential equations are integratedusing an exponential Euler update with a 0.5 ms time step. Whenthe membrane potential Vi reaches the threshold value of�54 mV, it is reset to �60 mV (values as in the numerical simula-tions by Seung, 2003). The reversal potential Eij of the synapse fromneuron j to neuron i is set to either 0 or �70 mV, depending onwhether the synapse is excitatory or inhibitory. The synaptic con-ductances are updated via DGij = Wij rij where rij is the neurotrans-mitter release variable that takes the value of 1 with probabilityequal to the probability that the synapse from neuron j to i releasesa neurotransmitter (when j spikes) and 0 otherwise (Seung, 2003).In the absence of presynaptic spikes Gij decays exponentially withtime constant ss = 5 ms. Wij are the ‘‘weights” which do not changeover time and are chosen randomly from an exponential distribu-tion with mean 14 nS for excitatory synapses and 45 nS for inhibi-tory synapses.

B.2. Implementing hedonistic synapses

Upon arrival of a presynaptic spike, a synapse can take two pos-sible actions with complementary probabilities; release a neuro-transmitter with probability p or fail to release with probability1 � p. The release parameter q is monotonically related to p bythe sigmoidal function given by:

p ¼ 11þ e�q

ðB:2Þ

Each synapse keeps a record of its recent actions through adynamical variable, the eligibility trace (Klopf, 1982). It increasesby 1 � p with every release and decreases by �p with every failure.Otherwise it decays exponentially with a given time constant.When a global reinforcement signal (h) is given to the network,it is subsequently communicated to each synapse which modifiesits release probability according to the nature of the signal (rewardor penalty) and its recent releases and failures. Learning is drivenby modifying q according to the rule given by:

Dq ¼ g� h� �e ðB:3Þ

where g is the learning rate and e the eligibility trace. Synapseseffectively learn by computing a stochastic approximation to thegradient of average reward. Moreover, if each synapse behaveshedonistically then the network as a whole behaves hedonistically,pursuing reward maximisation.

B.3. Overview of the reinforcement signals the networks receive duringa learning round

What follows is an overview of the reinforcement signals thatthe networks receive during a learning round according to all pos-sible outcomes of the previous round of the game and assuming adifferential bias equal to 0. Reinforcement is administered forevery spike of output neurons 1–4. For the CC outcome, networkI receives a reward of +4 for every spike of output neuron 1 anda penalty of �1.5 for every spike of output neuron 2 whereas net-work II receives a reward of +4 for every spike of output neuron 3and a penalty of �1.5 for every spike of output neuron 4. For the CDoutcome, network I receives a penalty of �3 for every spike of out-put neuron 1 and a reward of +1.5 for every spike of output neuron2 whereas network II receives a penalty of �1.5 for every spike ofoutput neuron 3 and a reward of +5 for every spike of output neu-ron 4. For the DC outcome, network I receives a reward of +5 forevery spike of output neuron 1 and a penalty of �1.5 for every

Page 10: Self-control with spiking and non-spiking neural networks playing games

C. Christodoulou et al. / Journal of Physiology - Paris 104 (2010) 108–117 117

spike of output neuron 2 whereas network II receives a penalty of�3 for every spike of output neuron 3 and a reward of +1.5 forevery spike of output neuron 4. Finally for the DD outcome, net-work I receives a reward of +1.5 for every spike of output neuron1 and a penalty of �2 for every spike of output neuron 2 whereasnetwork II receives a reward of +1.5 for every spike of output neu-ron 3 and a penalty of �2 for every spike of output neuron.

References

Ainslie, G., 1975. Specious reward: a behavioural theory of impulsiveness andimpulse control. Psychological Bulletin 82, 463–496.

Ariely, D., Wertenbroch, K., 2002. Procrastination, deadlines, and performance. Self-control by precommitment. Physiological Science 13, 219–224.

Axelrod, R., Hamilton, W.D., 1981. The evolution of cooperation. Science 211, 1390–1396.

Baker, F., Rachlin, H., 2001. Probability of reciprocation in repeated Prisoner’sdilemma games. Journal of Behavioral Decision Making 14, 51–67.

Banfield, G. (2006). Simulation of Self-Control Through Precommitment Behaviourin an Evolutionary System. Ph.D. Thesis, Birkbeck, University of London(<www.dcs.bbk.ac.uk/research/recentphds/banfield.pdf>).

Banfield, G., Christodoulou, C., 2005. Can self-control be explained through games?In: Cangelosi, A., Bugmann, G., Borisyuk, R. (Eds.), Modelling Language,Cognition and Action, Progress in Neural Processing, vol. 16. World Scientific,Singapore, pp. 321–330.

Bjork, J.M., Knutson, B., Fong, G.W., Caggiano, D.M., Bennett, S.M., Hommer, D.W.,2004. Incentive-elicited brain activation in adolescents: similarities anddifferences from young adults. Journal of Neuroscience 24, 1793–1802.

Brown, J., Rachlin, H., 1999. Self-control and social cooperation. BehaviouralProcesses 47, 65–72.

Carver, C.S., Scheier, M.F., 1998. On the Self-Regulation of Behavior. CambridgeUniversity Press, Cambridge.

Fodor, J.A., 1983. The Modularity of Mind. MIT Press, Cambridge, MA.Fudenberg, D., Tirole, J., 1991. Game Theory. MIT Press, Cambridge, MA.Jacobs, R.A., 1999. Computational studies of the development of functionally

specialized neural modules. Trends in Cognitive Sciences 3, 31–38.Kavka, G., 1991. Is individual choice less problematic than collective choice?

Economics and Philosophy 7, 143–165.Klopf, A.H., 1982. The Hedonistic Neuron: A Theory of Memory, Learning and

Intelligence. Hemisphere Publishing Cooperation, Washington, DC.Loewenstein, G.F., 1996. Out of control: visceral influences on behaviour.

Organizational Behaviour and Human Decision Processes 65, 272–292.

Mele, A., 1995. Conceptualizing self-control. Behavioral and Brain Sciences 18, 136–137.

Millar, A., Navarick, D.J., 1984. Self control and choice in humans: effects of videogame playing as a positive reinforcer. Learning and Motivation 15, 203–218.

Mischel, W., Shoda, Y., Rodriguez, M., 1989. Delay of gratification in children.Science 244, 933–938.

Morgan, C.T., King, R.A., Robinson, N.M., 1979. Introduction to Psychology, sixth ed.McGraw-Hill Kogakusha Ltd., Tokyo.

Nash, J.F., 1950. Equilibrium points in N-person games. Proceedings of the NationalAcademy of Sciences of the United States of America 36, 48–49.

Nesse, R.M., 2001. Natural selection and the capacity for subjective commitment. In:Nesse, R.M. (Ed.), Evolution and the Capacity for Commitment. Russell SageFoundation, New York, NY, pp. 1–44.

O’Reilly, R.C., Munakata, Y., 2000. Computational explorations in cognitiveneuroscience. Understanding the Mind by Simulating the Brain. MIT Press,Cambridge, MA.

Plaud, J.J., 1995. The behavior of self-control. Behavioral and Brain Sciences 18, 139–140.

Rachlin, H., 1995. Self-control: beyond commitment. Behavioral and Brain Sciences18, 109–159.

Rachlin, H., 2000. The Science of Self-Control. Harvard University Press, Cambridge,MA.

Rappoport, A., Chammah, A.M., 1965. Prisoner’s Dilemma: A Study in Conflict andCooperation. University of Michigan Press, Ann Arbor, MI.

Richmond, B.J., Liu, Z., Shidara, M., 2003. Predicting future rewards. Science 301,179–180.

Scheier, M.R., Carver, C.S., 1988. A model of behavioral self-regulation: translatingintention into action. In: Berkowitz, L. (Ed.), Advances in Experimental SocialPsychology, vol. 21. Academic Press, San Diego, CA, pp. 303–339.

Seung, H.S., 2003. Learning in spiking neural networks by reinforcement ofstochastic synaptic transmission. Neuron 40, 1063–1073.

Smith, J.M., 1982. Evolution and the Theory of Games. Cambridge University Press,Cambridge.

Solnick, J.W., Kannenberg, C., Eckerman, D.A., Waller, M.B., 1980. An experimentalanalysis of impulsivity and impulse control in humans. Learning and Motivation11, 61–77.

Sutton, R.S., 1988. Learning to predict by the method of temporal differences.Machine Learning 3, 9–44.

Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: An Introduction. MIT Press,Cambridge, MA.

Thaler, R.H., Shefrin, H.M., 1981. An economic theory of self-control. Journal ofPolitical Economy 89, 392–406.

Widrow, B., Gupta, N.K., Maitra, S., 1973. Punish/reward: learning with a critic inadaptive threshold systems. IEEE Transactions on Systems Man and Cybernetics3, 455–465.