Opponent Modelling in the Game of Tron using Reinforcement ...

Opponent Modelling in the Game of Tron using Reinforcement Learning

Stefan J.L. Knegt1, Madalina M. Drugan2 and Marco A. Wiering1

1Institute of Artificial Intelligence and Cognitive Engineering, University of Groningen, The Netherlands2ITLearns.Online, The Netherlands

[email protected], [email protected], [email protected]

Keywords: Reinforcement Learning, Opponent Modelling, Q-Learning, Computer Games

Abstract: In this paper we propose the use of vision grids as state representation to learn to play the game Tron usingneural networks and reinforcement learning. This approach speeds up learning by significantly reducing thenumber of unique states. Furthermore, we introduce a novel opponent modelling technique, which is used topredict the opponent’s next move. The learned model of the opponent is subsequently used in Monte-Carloroll-outs, in which the game is simulated n-steps ahead in order to determine the expected value of conducting acertain action. Finally, we compare the performance using two different activation functions in the multi-layerperceptron, namely the sigmoid and exponential linear unit (Elu). The results show that the Elu activationfunction outperforms the sigmoid activation function in most cases. Furthermore, vision grids significantlyincrease learning speed and in most cases this also increases the agent’s performance compared to when thefull grid is used as state representation. Finally, the opponent modelling technique allows the agent to learna predictive model of the opponent’s actions, which in combination with Monte-Carlo roll-outs significantlyincreases the agent’s performance.

1 INTRODUCTION

Reinforcement learning algorithms allow an agentto learn from its environment and thereby optimise itsbehaviour (Sutton and Barto, 1998). Such environ-ments can be modelled as a Markov Decision Pro-cess (MDP) (van Otterlo and Wiering, 2012; Bell-man, 1957), where an agent tries to learn an op-timal policy from trial and error. Reinforcementlearning algorithms have been widely applied in thearea of games. A well-known example is backgam-mon (Tesauro, 1995), where reinforcement learninghas led to great success. This paper examines the ef-fectiveness of reinforcement learning for the game ofTron. One of the main challenges of using reinforce-ment learning in games is the large size of the statespace. Another challenge is how an agent can learn tomodel its opponent effectively and use this opponent’smodel to significantly increase its performance.

To deal with large state spaces, in many casesthe agent is constructed using a multi-layer percep-tron (MLP) (Rumelhart et al., 1988). The MLP willreceive the current game state as its input and hasto determine the move that will result in the high-est reward in the long term. The combination of anMLP and reinforcement learning has showed promis-ing results, for instance in Backgammon (Tesauro,

1995), Ms. PacMan (Bom et al., 2013) and Star-craft (Shantia et al., 2011). Furthermore, deep rein-forcement learning using neural networks with manylayers have also obtained impressive results on a vari-ety of games (Mnih et al., 2013).

In most research on learning to play games withconnectionist reinforcement learning, the MLP usesonly the well-known sigmoid activation function.However, there are other choices such as the expo-nential linear unit (Elu). The exponential linear unithas three advantages compared to the sigmoid func-tion (Clevert et al., 2015). It alleviates the vanishinggradient problem by its identity for positive values,it can return negative values which might improvelearning, and it is better able to deal with a large num-ber of inputs. This activation function has shown tooutperform the ReLU in a convolutional neural net-work on the ImageNet dataset (Clevert et al., 2015).Another way to deal with large state spaces is to givethe agent a partial view of the environment. If welook at how humans play the game Tron we see thatthey mainly focus their attention around the currentposition of the agent. Therefore, vision grids (Shan-tia et al., 2011) can be useful. A vision grid can beseen as a snapshot of the environment from the agent’spoint of view. An example could be a three by threesquare around the ’head’ of the agent. By using a

vision grid of an appropriate size, the agent can ac-quire the most important information about the dy-namic state of the environment. Not only does thisdramatically decrease the number of unique states,it also reduces the amount of irrelevant information,which can speed up the learning process of the agent.

For most game research, the agent does not learnan explicit opponent model. In most cases, roll-outsor lookahead strategies are used that select opponent’sactions according to how the agent itself would se-lect actions or according to simple rules. Althoughroll-outs have shown to substantially increase per-formance in games such as Backgammon (Tesauroand Galperin, 1997), Go (Bouzy and Helmstetter,2004; Silver et al., 2016a), and Scrabble (Shep-pard, 2002), the disadvantage of this approach isthat particular weaknesses of the opponent cannotbe exploited, as no true model of how the oppo-nent selects actions is used. Opponent modelling hasbeen studied for imperfect-information games such aspoker (Ganzfried and Sandholm, 2011; Southey et al.,2005). Furthermore, in combination with Q-learning(Watkins and Dayan, 1992) it has proven to lead tobetter performances (He et al., 2016). However, asnoted by (Collins, 2007), the learned models are oftenenvironment specific and take considerable effort tolearn. As a solution to this problem, Mealing (Meal-ing, 2015) proposed a dynamic opponent modellingvariant, which uses sequence prediction to learn highrewarding strategies.

Contributions: In this paper, we developed dif-ferent state representations for the game of Tron. Weshow that with vision grids we can reduce the numberof unique states, which helps overcoming the chal-lenge of using reinforcement learning in problemswith large state spaces. We use the information fromthe vision grids as input for a multi-layer perceptronthat is trained using a reinforcement learning algo-rithm. Next to using the common sigmoid function inthe hidden layer of the MLP, we will also use the Eluactivation function and compare the results of bothactivation functions. The most important contributionof this paper is a novel opponent modelling technique.In our proposed algorithm, the agent learns the oppo-nent’s behaviour by predicting the next move of theopponent, observing the result, and adjusting the neu-ral network’s parameters based on this observation. Ifthe opponent is following a policy, the agent shouldbe able to learn this policy over time. This modelof the opponent is subsequently used in Monte-Carloroll-outs. In such a roll-out the game is simulated nsteps ahead in order to determine the expected valueof performing action a in state s and subsequently ex-ecuting the action that is associated with the highest

Q-value in each state. In these roll-outs, the learnedopponent model is used to select actions for the oppo-nent. The roll-outs are performed multiple times andthe results are averaged. We performed many differ-ent experiments to compare all methods (3 state rep-resentations, sigmoid / Elu, opponent model / no op-ponent model, different numbers of roll-outs). Fromthe results we can conclude that vision grids are effec-tive for faster training and better final performances.Furthermore, when we combine the vision grids withopponent modelling and roll-outs, the performancesare very good, reaching very high scores against 2 dif-ferent fixed opponents.

Outline: In the next section we explain the frame-work that was built to simulate the game and agent.Section 3 describes reinforcement learning combinedwith multi-layer perceptrons. In Section 4, we explainthe use of vision grids for Tron and the novel oppo-nent modelling technique. Then in section 5 we de-scribe the experiments and show their results. Finally,in section 6 we present our conclusions and possiblefuture work.

2 THE GAME OF TRON

Tron is an arcade video game released in 1982 andwas inspired by the Walt Disney motion picture Tron.In this game the player guides a light cycle in an arenaagainst an opponent. The player has to do this, whileavoiding the walls and the trails of light left behindby the opponent and player itself. See Figure 1 fora graphical depiction of the game. We developed aframework that implements the game of Tron as a se-quential decision problem where each agent selectsan action for each new game state. In this researchthe game is played with two players. The environ-ment is represented by a 10 by 10 grid in which theplayer starts at a random location in the top half ofthe grid and the opponent in the bottom half. Afterthat, both players decide on an action to carry out.The action space consists of the four directions theagents can move in. When the action selection phaseis completed, both actions get carried out and the newgame state is evaluated. In case both agents move tothe same location, the game ends in a draw. A playerloses if it moves to a location that is previously vis-ited by either itself or the opponent or when the agentwants to move to a location outside of the grid. Ifit happens that both agents lose at the same moment,the game counts as a draw. We estimate the numberof possible different states in the game to be of theorder 1020, which is similar to the game Othello thatconsists of a board of 7×7 cells.

Figure 1: Tron game environment with two agents, wheretheir heads or current location are in a darker colour.

For the opponent we used two different imple-mentations. Both fixed opponents always first checkwhether their intended move is possible and thereforewill never lose unless they are fully enclosed. Thefirst agent randomly chooses an action from the pos-sible actions, while the second agent always tries toexecute its previous action again. If this is not pos-sible, the opponent randomly chooses an action thatis possible and keeps repeating that action. This im-plies that this opponent only changes its action whenit encounters a wall, the opponent or its own tail. Thisstrategy is very effective in the game of Tron, becauseit is very efficient in the use of free space and it makesthe agent less likely to enclose itself. We tested theseopponents by letting them play against each other,and observed that the opponent employing the strat-egy of going straight as long as possible only loses25% of the games and 20% of the games end in adraw. From here on we will refer to the agent em-ploying the collision-avoiding random policy as therandom opponent and the other opponent will be re-ferred to as the semi-deterministic opponent.

3 REINFORCEMENT LEARNING

When the agent starts playing the game, it willrandomly choose actions from its action space. In or-der to improve its performance, the agent has to learnthe best action in a given game state and therefore wetrain the agent using reinforcement learning. Rein-forcement learning is a learning method in which theagent learns to select the optimal action based on in-game rewards. Whenever the agent loses a game it re-ceives a negative reward or punishment and if it wins

it will receive a positive reward. As it plays a largenumber of games, the agent should learn to select theaction that leads to the highest possible expected re-ward given the current game state. Reinforcementlearning techniques are often applied to environmentsthat can be modelled as a so-called Markov DecisionProcess (MDP) (Bellman, 1957). An MDP is definedby the following components:

• A finite set of states S, where st ∈ S is the state attime t.

• A finite set of actions A, where at ∈ A is the actionexecuted at time t.

• A transition function T (s,a,s′). This functionspecifies the probability of ending up in state s′

after executing action a in state s. Whenever theenvironment is fully deterministic, we can ignorethe transition probability. This is not the case inthe game of Tron, since it is played against an op-ponent for which we cannot perfectly anticipateits next move.

• A reward function R(s,a,s′), which specifies thereward for executing action a in state s and sub-sequently going to state s′. In our framework, thereward is equal to 1 for a win, 0 for a draw, and−1 in case the agent loses. Note that there are nointermediate rewards.

• A discount factor γ to discount future rewards,where 0≤ γ≤ 1.

To let the agent act in this MDP, we need a mappingfrom states to actions. This is given by the policy π(s),which returns for any state s the action to perform.The value of a policy is given by the sum of the dis-counted future rewards starting in a state s followingthe policy π:

V π(s) = E( ∞

∑t=0

γtrt |s0 = s,π

)(1)

Where rt is the reward received at time t. Thevalue function gives the expected outcome of thegame if both players select the actions given by theirpolicy. The value of a state is the long-term reward theagent will receive, while the reward of a state is onlyshort-term. Therefore, the agent has to choose thestate with the highest possible value. We can rewriteequation 1 in terms of the components of an MDP:

V π(s) = ∑s′

T (s,π(s),s′)(R(s,π(s),s′)+ γ V π(s′)) (2)

From equation 2 we see that the value of a par-ticular state s depends on the transition function, the

probability of going to state s′ times the reward ob-tained in this new state s′ and the value of the nextstate times the discount factor. In practice, the transi-tion function is often unknown and therefore we haveto use a reinforcement learning algorithm. Next, wewill look at the particular reinforcement learning al-gorithm employed in this research: Q-learning.

3.1 Q-learning

In this research we will be using Q-learning (Watkinsand Dayan, 1992), for which the value of a state be-comes a Q-value of a state-action pair, Q(s,a), whichgives the value of performing action a in state s. ThisQ-value for a given policy is given by equation 3.

Qπ(s,a) = E( ∞

∑t=0

γtrt |s0 = s,a0 = a,π

)(3)

The value of performing action a in state s is theexpected sum of the discounted future rewards fol-lowing policy π. The Q-value of an individual state-action pair is given by:

Q(st ,at) = E(rt)+ γ ∑st+1

T (st ,at ,st+1)maxa

Q(st+1,a) (4)

The Q-value of a state-action pair depends on theexpected reward and the highest Q-value in the nextstate. However, we do not know st+1 as it dependson the action of the opponent. Therefore, Q-learningkeeps a running average of the Q-value of a certainstate-action pair. The Q-learning algorithm is givenby:

Q̂(st ,at)← Q̂(st ,at)+α(rt +γmaxa

Q̂(st+1,a)−Q̂(st ,at))

Where 0 ≤ α ≤ 1 denotes the learning rate. As weencounter the same state-action pair multiple times,we update the Q-value to find the average Q-value ofthis state-action pair. This kind of learning is calledtemporal-difference learning (Sutton, 1988).

3.2 Function Approximator

Whenever the state space is relatively small, one caneasily store the Q-values for all state-action pairs ina lookup table. However, since the state space in thegame of Tron is far from small the use of a lookuptable is not feasible in this research. In addition, sincethere are many different states it could happen thateven after training, some states have not been encoun-tered before. When a state has not been encounteredbefore, action selection happens without information

from experience. Therefore, we use a neural networkas function approximator. To be more precise, wewill be using a multi-layer perceptron (MLP) to es-timate Q(s,a). This MLP will receive as input thecurrent game state s and its output will be the Q-valuefor each action given the input state. One could alsochoose to use four different MLPs, which output oneQ-value each (one for every action). We have testedboth set-ups and there was a small advantage of usinga single action neural network. The neural networkis trained using back-propagation (Rumelhart et al.,1988), where the target Q-value is calculated usingequation 5. As a simplification we set the learningrate α in this equation equal to 1, because the back-propagation algorithm of the neural network alreadycontains a learning rate, which controls for the speedof learning. The target Q-value for action at in statest is therefore:

Qtarget(st ,at)← rt + γmaxa

Q̂(st+1,a) (5)

This target is valid as long as the action taken in thestate-action pair does not result in the end of the game.Whenever that is the case, the target Q-value is equalto the first term of the right-hand side of equation 5,the reward received in the final game:

Qtarget(st ,at)← rt (6)

3.2.1 Activation function

In order to allow the neural network’s value functionapproximation to be non-linear we use an activationfunction in the hidden layer. One of the most oftenused activation functions is the sigmoid function:

O(a) =1

1+ e−a (7)

This function transforms the weighted sum of inputsfor a hidden unit to a value between 0 and 1. Re-cently, it has been proposed that the exponential linearunit performs better in some domains (Clevert et al.,2015). We will compare the performance of the agentusing the sigmoid function and the exponential linearunit (Elu) in the hidden layer. The exponential linearunit is given by the following equation:

O(a) =

{a if a≥ 0β(ea−1) if a < 0

(8)

We set β equal to 0.01 after some preliminary exper-iments. This function transforms negative activationsto a small negative value, while positive activationis unaffected. We will compare the performance ofthe agent with both activation functions to determinewhich performs better for learning to play Tron.

4 STATE REPRESENTATION ANDOPPONENT MODELS

In this section, we will first describe the differentstate representations that will be used by the agent.Then, we will describe how a model of the opponentcan be learned and used for selecting actions usingroll-outs.

4.1 Vision Grids

The first state representation used as input to the MLPis the entire game grid (10× 10). This translates to100 input nodes, which have a value of one wheneverit is visited by one of the agents and zero otherwise.Another 10 by 10 grid is fed into the network, but thistime only the current position of the agent has a valueof one. This input allows the agent to know its owncurrent position within the environment. The secondtype of state representation and input to the MLP thatwill be tested are vision grids. A vision grid can beseen as a snapshot of the environment taken from thepoint of view of the agent. This translates to a squaregrid with an uneven dimension centred around thehead of the agent. To receive the most relevant in-formation from the state of the game, three differenttypes of vision grids are combined (in all these gridsthe standard value is zero):

• The player grid contains information about the lo-cations visited by the agent itself: whenever theagent has visited the location it will have a valueof one instead of zero.

• The opponent grid contains information about thelocations visited by the opponent: if the opponentis in the ’visual field’ of the agent these locationsare encoded with a one.

• The wall grid represents the walls: whenever theagent is close to a wall the wall locations will geta value of one.

An example game state and the three associated visiongrids can be found in Figure 2. We will test visiongrids with a size of three by three (small vision grids)and five by five (large vision grids) and compare theperformance of the agents with these small and largevision grids to an agent that receives all informationfrom the game state.

4.2 Opponent Modelling

This paper introduces an opponent modelling tech-nique with which a model of the opponent is learnedfrom observations. This model can subsequently be

Figure 2: Vision grid example with the current location ofboth players in a darker color.

used in planning algorithms such as Monte-Carlo roll-outs. Planning is one of the key challenges of arti-ficial intelligence (Silver et al., 2016b). Many op-ponent modelling techniques focus on probabilisticmodels and imperfect-information games (Southeyet al., 2005; Ganzfried and Sandholm, 2011), whichmakes them very problem specific. Our novel oppo-nent modelling technique predicts the opponent’s ac-tion using the multi-layer perceptron and learns fromthe observed actions using the back-propagation al-gorithm (Rumelhart et al., 1988). Over time the agentlearns to model which action the opponent will likelyselect when it is in a specific state. The model is aprobability distribution of the opponent’s next movegiven the state representation. Because of its simplic-ity, this technique can be generalised to any setting inwhich the opponent’s actions are observable. Anotherbenefit of this technique is that the agent simultane-ously learns a policy and a model of the opponent,which means that no extra phase is needed for thelearning process. In addition, the opponent modellinghappens with the same neural network that calculatesthe Q-values for the agent. This might allow the agentto learn hidden features regarding the opponent’s be-haviour, which could further increase performance.

For modelling the opponent, four output nodes areappended to the network, which represent the prob-ability distribution over the opponent’s possible ac-tions. The output can be interpreted as a probabilitydistribution, because we use a softmax layer over thefour appended output nodes. The softmax functiontransforms the vector o containing the output mod-elling values for the next K = 4 possible actions ofthe opponent to values in the range [0,1] that add up

to one:

P(st ,oi) =eoi

∑Kk=1 eok

(9)

This transforms the output values to the probability ofthe opponent conducting action oi in state st . In addi-tion to these four extra output nodes, the state repre-sentation for the neural network changes when mod-elling the opponent. In the case of the standard inputrepresentation by the full grid, an extra grid is addedwhere the head of the opponent has a value of one.In the case of vision grids, an extra 4 vision grids areconstructed. The first three are the same as before, butthen from the opponent’s point of view. In addition,an opponent-head grid is constructed which containsinformation about the current location of the head ofthe opponent. If the opponent’s head is in the agent’svisual field, this location will be encoded with a one.

In order to learn the opponent’s policy, the net-work is trained using back-propagation where the tar-get vector is one for the action taken by the opponentand zero for all other actions. If the opponent is fol-lowing a deterministic policy, this allows the agent toperfectly forecast the opponent’s next move after suf-ficient training. Although in reality a policy is seldomentirely deterministic, players use certain rules to playa game. Therefore, our semi-deterministic agent is aperfect example to test opponent modelling against.

Once the agent has learned the opponent’s policy,its prediction about the opponent’s next move will beused in so-called Monte Carlo roll-outs (Tesauro andGalperin, 1997). Such a roll-out is used to estimatethe value Qsim(s,a), the expected Q-value of perform-ing action a in state s and subsequently performingthe action suggested by the current policy for n− 1steps. The opponent’s actions are selected on the ba-sis of the agent’s model of the agent. If one roll-outis used the opponent’s move with the highest prob-ability is carried out. When more than one roll-outis performed, the opponent’s action is selected basedon the probability distribution. At every action selec-tion moment in the game m roll-outs of length n areperformed and the results are averaged. The expectedQ-value is equal to the reward obtained in the simu-lated game (1 for winning, 0 for a draw, and -1 forlosing) times the discount factor to the power of thenumber of moves conducted in this roll-out i:

Q̂sim(st ,at) = γirt+i (10)

If the game is not finished before reaching the roll-out horizon the simulated Q-value is equal to the dis-counted Q-value of the last action performed:

Q̂sim(st ,at) = γnQ̂(st+n,at+n) (11)

See algorithm 1 for a detailed description.

This kind of roll-out is also called a truncated roll-out as the game is not necessarily played to its conclu-sion (Tesauro and Galperin, 1997). In order to deter-mine the importance of the number of roll-outs m, wewill compare the performance of the agent with oneroll-out and ten roll-outs.

Algorithm 1 Monte-Carlo Roll-out with OpponentModelInput: Current game state st , starting action at , hori-

zon N, number of roll-outs MOutput: Average reward of performing action at at

time t and subsequently following the policy overM roll-outsfor m = 1,2, ..M do

i = 0Perform starting action atif M = 1 then

ot ← argmaxoP(st ,o)else if M > 1 then

ot ← sample P(st ,o)end ifPerform opponent action otDetermine reward rt+irolloutRewardm = γrt+iwhile not game over do

i = i+1at+i← argmaxaQ(st+i,a)Perform action at+iif M = 1 then

ot+i← argmaxoP(st+i,o)else if M > 1 then

ot+i← sample P(st+i,o)end ifPerform opponent action ot+iDetermine reward rt+iif Game over then

rolloutRewardm = γ irt+iend ifif not Game over and i = N then

game over← TruerolloutRewardm = γ NQ(sN ,aN)

end ifend whilerewardSum = rewardSum+ rolloutRewardmm = m+1

end forreturn rewardSum/M

5 EXPERIMENTS AND RESULTS

To compare the different state representations, theuse of different activation functions in the MLP andthe usefulness of the opponent modelling techniqueand roll-outs, many different experiments have beenconducted. In all experiments the agent is trained for1.5 million games against two different opponents,which lasts for around one day for one simulation.After that, 10,000 test games are played. In these testgames, the agent makes no explorative actions. In or-der to obtain meaningful results, all experiments areconducted ten times and the results are averaged. Theperformance is measured as the number of games wonplus 0.5 times the number of games tied. This numberis divided by the number of games to get a score be-tween 0 and 1. This is a common performance scorefor games.

With the use of different game state representa-tions as input to the MLP, the number of input nodesvaries. The number of hidden nodes varies from 100to 300 and is chosen such that the number of hiddennodes is at least equal but preferably larger than thenumber of input nodes. This was found to be opti-mal in the trade-off between representation power andcomplexity. Also, the use of several hidden layers hasbeen tested, but this did not significantly improve per-formance and we therefore chose to use only one hid-den layer.

5.1 State Representations

For setting all hyper-parameters of the different al-gorithms, we ran many preliminary experiments. Inthe first part of this research, without opponent mod-elling, the number of input nodes for the full gridis equal to 200 and the number of hidden nodes is300. When vision grids are used, the number of inputnodes decreases to 27 and 75 for vision grids with adimension of three by three and five by five respec-tively. The number of hidden nodes when using smallvision grids is equal to 100, while for large visiongrids 200 hidden nodes are used. In all these casesthe number of output nodes is four.

During training, exploration decreases linearlyfrom 10% to 0% over the first 750,000 games af-ter which the agent always performs the action withthe highest Q-value. This exploration strategy hasbeen selected after performing preliminary experi-ments with several different exploration strategies.There is one exception to this exploration strategy.When large vision grids are used against the semi-deterministic opponent, exploration decreases from10% to 0% over the 1.5 million training games. In

this condition the exploration policy is different, be-cause the standard exploration settings led to unstableresults. The learning rate α and discount factor γ are0.005 and 0.95 respectively and are equal across allconditions except for one. These values have been se-lected after conducting preliminary experiments withdifferent learning rates and discount factors. Whenthe full grid is used as state representation and theagent plays against the random opponent, the learn-ing rate α is set to 0.001. The learning rate is loweredfor this condition, because a learning rate of 0.005 ledto unstable results. All weights and biases of the net-work are randomly initialised between −0.5 and 0.5.

In Figure 3, 4, and 5 the performance score duringtraining is displayed for the three different state repre-sentations. In every figure we see the performance ofthe agent against the random and semi-deterministicopponent with both the sigmoid and Elu activationfunction. For every 10,000 games played we plot theperformance score, which ranges from 0 to 1. We seethat for all three state representations performance in-creases strongly as long as some explorative actionsare made. When exploration stops at 750,000 games,performance stays approximately the same, except forthe full grid state representation with the Elu acti-vation function against the semi-deterministic oppo-nent. We have also experimented with a constantexploration of 10% and with exploration graduallyfalling to 0% over all training games, however thisdid not lead to better performances. After training theagent, we tested the agent’s performance on 10,000test games. The results are displayed in Table 1 and2. These results are gathered from ten independenttrials, for which also the standard error is reported.

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted

Small_VGLarge_VGFull_Grid

Prediction against semi−deterministic opponent

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core


Training performance large vision grids opponent modelling

x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core

Random_SigmoidRandom_EluDeterministic_SigmoidDeterministic_Elu


0.25

0.50

0.75

1.00

0 50 100 150Games played

Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core


Training performance large vision grids

x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core


Training performance small vision grids opponent modelling

x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core


Training performance small vision grids

Figure 3: Performance score for small vision grids as staterepresentation over 1.5 million training games. Note thatafter 750,000 games the agent stops performing explorationmoves.

From Table 1 and 2 we can conclude that with the

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



Figure 4: Performance score for large vision grids as staterepresentation over 1.5 million training games.

x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core


Training performance full grid

Figure 5: Performance score for the full grid as state repre-sentation over 1.5 million training games.

sigmoid activation function, the use of vision gridsincreases the performance of the agent when com-pared to using the full grid. Against the randomopponent, the small vision grid with the Elu activa-tion function performs best. Striking is the perfor-mance of the agent using the full grid against thesemi-deterministic opponent using the Elu activationfunction, which can be found in Table 2. The agentreaches a performance score of 0.72 in this case,which is the highest performance score obtained. Thisfinding might be caused by the fact that the agent canactually profit from the semi-deterministic policy theopponent is following, which it detects when the fullgrid is used as state representation because it providesmore information about the past moves of the oppo-nent. Against both opponents, the use of the Elu ac-tivation function with the full-grid representation per-forms significantly better than the sigmoid function.

Table 1: Performance score and standard errors against therandom opponent.

State representation Sigmoid EluSmall vision grids 0.56 (0.037) 0.62 (0.019)Large vision grids 0.54 (0.036) 0.53 (0.022)Full grid 0.49 (0.017) 0.58 (0.025)

Table 2: Performance score and standard errors against thedeterministic opponent.


5.2 Opponent Modelling withoutMonte-Carlo Roll-outs

Opponent modelling requires information not onlyabout the agent’s current position, but also about theopponent’s position. As explained in section 4, thisincreases the number of vision grids used and there-fore affects the number of inputs and best found num-ber of hidden nodes of the MLP. In the basic casewhere the full grid is used, the number of input nodesincreases to 300 and the number of hidden nodes stays300. For the large vision grids the number of in-put nodes increases to 175 and the number of hiddennodes increases to 300. Finally, when using the smallvision grids the number of input nodes becomes 63and the number of hidden nodes increases to 200. Inall networks with opponent modelling the number ofoutput nodes is eight (the 4 Q-values for the differ-ent actions and the 4 outputs to model the opponent’sprobability of selecting that action).

For these experiments preliminary experimentsshowed that decreasing the exploration from 10% to0% over the first 750,000 games led to the best resultsin most cases. However, with large vision grids andthe sigmoid activation function against the randomopponent, exploration decreases from 10% to 0% over1 million training games. The learning rate α and dis-count factor γ are for the opponent modelling experi-ments also 0.005 and 0.95 respectively. These valueshave been found to lead to the best results, howeverthere are some exceptions. When the full grid is usedas state representation in combination with the sig-moid activation function, the learning rate is loweredto 0.001. This lower learning rate is also used withsmall vision grids and the sigmoid activation functionagainst the random opponent. Finally, when large vi-sion grids are used in combination with the sigmoidactivation function against the random opponent, alearning rate of 0.0025 is used. Similar to the previ-ous experiments, all weights and biases of the neural

networks are randomly initialised between −0.5 and0.5.

For the opponent modelling experiments wetrained the agent against both opponents and withboth activation functions. We note that in this ex-periment, no roll-outs are performed. Therefore anypossible performance improvement is caused by theadditional state information or the use of the addi-tional outputs that learn to model the opponent. Thelatter could be helpful to learn better features in thehidden layer. Figures 6, 7, and 8 show the trainingperformance for the three different state representa-tions. Table 3 and 4 show the performance during the10,000 test games after training the agent with oppo-nent modelling.

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



Figure 6: Performance score for small vision grids as staterepresentation over 1.5 million training games with oppo-nent modelling but without rollouts.

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



Figure 7: Performance score for large vision grids as staterepresentation over 1.5 million training games with oppo-nent modelling but without rollouts.

When we compare these results with the resultsobtained without opponent modelling, we observeseveral differences. First of all, when the full grid

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core



x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core


Training performance small vision grids

x104

0.25

0.50

0.75

1.00


Perf

orm

ance s

core


Training performance full grid opponent modelling

Figure 8: Performance score for the full grid as state rep-resentation over 1.5 million training games with opponentmodelling but without rollouts.

Table 3: Performance score and standard errors with oppo-nent modelling without rollouts against the random oppo-nent.


Table 4: Performance score and standard errors with op-ponent modelling without rollouts against the deterministicopponent.


is used as state representation the performance dropswith opponent modelling. The opposite holds for bothsmall and large vision grids, where performance in-creases with opponent modelling. The most signif-icant increase in performance appears with large vi-sion grids against the semi-deterministic opponent,where a performance score of 0.90 is obtained.

In order to test whether this increase in perfor-mance with vision grids arises due to the opponentmodelling technique, we conducted another experi-ment. In this experiment the set-up is exactly the sameas in the opponent modelling experiment, but now theagent does not learn to model the opponent. The av-erage results of ten test games with the Elu activationfunction can be found in Table 5.

From Table 5 we can conclude that the agent’sincrease in performance with opponent modelling isdue to the extra vision grids generated. This is thecase since there is not much difference in performancewith and without opponent modelling when the extravision grids for opponent modelling are also fed into

Table 5: Performance score and standard errors with the Eluactivation function and opponent vision grids, but withoutopponent modelling.

State representation Random DeterministicSmall vision grids 0.69 (0.008) 0.69 (0.003)Large vision grids 0.82 (0.009) 0.89 (0.003)

x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted


Prediction against random opponent

Figure 9: Percentage of moves correctly predicted againstthe random opponent.

the MLP.

5.3 Opponent Modelling withMonte-Carlo Roll-outs

After the agent is trained using opponent modelling,we applied roll-outs in order to try to increase the per-formance of the agent even further. The number ofactions in a roll-out is set to ten, as this gives theagent the opportunity to look far enough in the fu-ture to choose the optimal action. Further increas-ing the number of actions of a roll-out will often notbenefit the agent, as the average amount of actions ina game is twenty. We compare the performance ofthe agent with one and ten roll-outs. Since the oppo-nent’s actions within the roll-outs are determined bythe learned probability distribution, we plot the pre-diction accuracy of the agent against both agents inFigure 9 and 10. These results are for the Elu acti-vation function, which learns slightly faster than thesigmoid activation function. We observe that within25,000 games the agent correctly predicts 50% ofthe random opponent’s moves and 90% of the semi-deterministic opponent’s moves when we use visiongrids. When the full grid is used, this accuracy isequal to 40% and 80% respectively.

The performance score and standard error usingone roll-out with a horizon of ten steps during 10,000test games can be found in Table 6 and 7. TheMonte-Carlo roll-outs further increase the agent’s per-

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



x100

0.25

0.50

0.75

1.00

0 50 100 150 200 250Games played

% c

orr

ect pre

dic

ted



Figure 10: Percentage of moves correctly predicted againstthe semi-deterministic opponent.

Table 6: Performance score and standard errors with oneroll-out and a depth of ten actions against the random oppo-nent.


formance in most cases. However, performance de-creases when large vision grids are used against therandom opponent. In all other cases, performanceconsiderably increases with the use of roll-outs. Thehighest performance score obtained is 0.98, whichis obtained with large vision grids and the Elu acti-vation function against the semi-deterministic oppo-nent. This shows that by applying opponent mod-elling and Monte-Carlo roll-outs, performance can beincreased to very high levels. From Table 7 we ob-serve that also with small vision grids, performancescores of over 0.90 are obtained against the semi-deterministic opponent. If we compare the resultswith vision grids and the full grid as state repre-sentation, we observe that vision grids significantlyincrease performance with opponent modelling andMonte-Carlo roll-outs. This increase is most evi-dent against the semi-deterministic opponent. Whenthe opponent employs the collision-avoiding randompolicy, small vision grids lead to the highest perfor-mance. When comparing Table 3 and 6, we see that

Table 7: Performance score and standard errors with oneroll-out and a depth of ten actions against the deterministicopponent.


Table 8: Performance score and standard errors with tenroll-outs and a depth of ten actions against the random op-ponent.


Table 9: Performance score and standard errors with tenroll-outs and a depth of ten actions against the deterministicopponent.


roll-outs also increase performance against this ran-dom opponent. This shows that although the policy ofthe opponent is far from deterministic, opponent mod-elling still significantly increases performance from0.67 to 0.83 with the sigmoid activation function andfrom 0.67 to 0.84 with the Elu activation functionwhen small vision grids are used as state represen-tation.

After applying one roll-out for each action at anystate, we also tested whether increasing the numberof roll-outs to ten would affect the agent’s perfor-mance. The results are displayed in Table 8 and 9.When comparing the agent’s performance with oneand ten roll-outs, we detect one noteworthy differ-ence. The agent’s performance against the randomopponent considerably increases when we use ten in-stead of one roll-out. This increase is especially largewhen we use large vision grids. Against the semi-deterministic opponent, increasing the number of roll-outs has no noticeable effect. This is because theagent predicts the semi-deterministic opponent cor-rectly in over 90% of the cases, causing the advantageof action sampling and multiple roll-outs to be absent.

In order to determine whether it is the model ofthe opponent that allows the agent to attain very highperformance levels using roll-outs, we also investi-gated the performance of the agent when the movesof the opponent in the roll-outs are determined ran-domly rather than from the learned model of the op-ponent. The results can be found in Table 10 and 11.From the results, we can conclude that it is indeedthe model of the opponent that increases the agent’sperformance when roll-outs are used, because a badopponent model results in much worse performancesin combination with roll-outs.

Table 10: Performance score and standard errors with oneroll-outs and a depth of ten actions against the random op-ponent and without using the learned model of the oppo-nent.


Table 11: Performance score and standard errors with oneroll-outs and a depth of ten actions against the deterministicopponent and without using the learned model of the oppo-nent.


6 CONCLUSION

This paper has shown that vision grids can be usedto overcome the problems associated with applyingreinforcement learning in problems with large statespaces. Using vision grids as state representation notonly increased the learning speed, it also increased theagent’s performance in most cases. From all state rep-resentations, the large vision grids obtain the best per-formances. They reduce the number of different pos-sible inputs compared to full grids, but contain moreinformation that the small vision grids.

This paper also confirms the benefits of the Elu ac-tivation function over the sigmoid activation function.Against the semi-deterministic opponent, the Elu ac-tivation function increased the agent’s performancein eleven of the twelve conducted experiments andagainst the random opponent performance increasedin eight of the twelve experiments. From this it seemsthat the Elu activation function performs especiallymuch better than the sigmoid function in case of lessnoisy updates due to the more deterministic opponent.

Finally, the introduced opponent modelling tech-nique allows the agent to concurrently learn andmodel the opponent and in combination with plan-ning algorithms, such as Monte-Carlo roll-outs, it canbe used to significantly increase performance againsttwo widely different opponents.

An interesting possibility for future research is totest whether the use of vision grids causes the agentto form a better generalised policy. We believe thatthis is the case, since vision grids are less depen-dent on the dimensions of the environment and pos-sible obstacles the agent might encounter. Therefore,the learned policy will better generalise to other envi-ronments. Finally, the proposed opponent modelling

technique is widely applicable and we are interestedto see whether it also proves useful in other problems.

REFERENCES

Bellman, R. (1957). A markovian decision process. IndianaUniv. Math. J., 6 No. 4:679–684.

Bom, L., Henken, R., and Wiering, M. (2013). Reinforce-ment learning to train Ms. Pac-Man using higher-orderaction-relative inputs. In 2013 IEEE Symposium onAdaptive Dynamic Programming and ReinforcementLearning (ADPRL), pages 156–163.

Bouzy, B. and Helmstetter, B. (2004). Monte-Carlo GoDevelopments, pages 159–174. Springer US, Boston,MA.

Clevert, D., Unterthiner, T., and Hochreiter, S. (2015). Fastand accurate deep network learning by exponentiallinear units (elus). CoRR, abs/1511.07289.

Collins, B. (2007). Combining opponent modeling andmodel-based reinforcement learning in a two-playercompetitive game. Master’s thesis, School of Infor-matics, University of Edinburgh.

Ganzfried, S. and Sandholm, T. (2011). Game theory-basedopponent modeling in large imperfect-informationgames. In the 10th International Conference onAutonomous Agents and Multiagent Systems-Volume2, pages 533–540. International Foundation for Au-tonomous Agents and Multiagent Systems.

He, H., Boyd-Graber, J. L., Kwok, K., and III, H. D. (2016).Opponent modeling in deep reinforcement learning.CoRR, abs/1609.05559.

Mealing, R. A. (2015). Dynamic opponent modelling intwo-player games. PhD thesis, University of Manch-ester, UK.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M.(2013). Playing atari with deep reinforcement learn-ing. arXiv preprint arXiv:1312.5602.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).Neurocomputing: Foundations of Research, chapterLearning Internal Representations by Error Propaga-tion, pages 673–695. MIT Press, Cambridge, MA,USA.

Shantia, A., Begue, E., and Wiering, M. (2011). Con-nectionist reinforcement learning for intelligent unitmicro management in Starcraft. In Neural Networks(IJCNN), The 2011 International Joint Conference on,pages 1794–1801. IEEE.

Sheppard, B. (2002). World-championship-caliber scrab-ble. Artificial Intelligence, 134(12):241 – 275.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,van den Driessche, G., Schrittwieser, J., Antonoglou,I., Panneershelvam, V., Lanctot, M., Dieleman, S.,Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,T., and Hassabis, D. (2016a). Mastering the game ofGo with deep neural networks and tree search. Nature,529(7587):484–489.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez,A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabi-nowitz, N., Barreto, A., and Degris, T. (2016b). Thepredictron: End-to-end learning and planning. CoRR,abs/1612.08810.

Southey, F., Bowling, M., Larson, B., Piccione, C., Burch,N., Billings, D., and Rayner, C. (2005). Bayes bluff:Opponent modelling in poker. In Proceedings of the21st Annual Conference on Uncertainty in ArtificialIntelligence (UAI, pages 550–558.

Sutton, R. S. (1988). Learning to predict by the methods oftemporal differences. Machine Learning, 3(1):9–44.

Sutton, R. S. and Barto, A. G. (1998). Introduction to Re-inforcement Learning. MIT Press, Cambridge, MA,USA, 1st edition.

Tesauro, G. (1995). Temporal difference learning and TD-gammon. Commun. ACM, 38(3):58–68.

Tesauro, G. and Galperin, G. R. (1997). On-line PolicyImprovement using Monte-Carlo Search, pages 1068–1074. MIT Press.

van Otterlo, M. and Wiering, M. (2012). ReinforcementLearning and Markov Decision Processes, pages 3–42. Springer Berlin Heidelberg, Berlin, Heidelberg.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machinelearning, 8(3):279–292.

Opponent Modelling in the Game of Tron using Reinforcement ...

Documents

Transcript of Opponent Modelling in the Game of Tron using Reinforcement ...