for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion -...

107

Transcript of for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion -...

Page 1: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Model-based Learning of Interaction Strategies in

Multi-agent Systems

Research Thesis

Submitted in partial ful�llment of the requirementsfor the degree of Doctor of Science

David Carmel

Submitted to the Senate of the Technion - Israel Institute of Technology

Heshvan 5758 Haifa November 1997

Page 2: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

This research was carried out in the Faculty of Computer Scienceunder the supervision of Dr. Shaul Markovitch and Dr. Je�rey S. Rosenschein.

The generous �nancial help of the of the graduate school is gratefully acknowledged.

Acknowledgments

I would like to acknowledge some of the people who have encouraged me and helped me to completethis research. I extend my deepest gratitude to:

� My �rst supervisor, Shaul Markovitch, for being just as eager as I in my own research. Wehave spent countless hours discussing interaction strategies and algorithms. His experiencein learning and search was of a great value and many of the ideas appearing in this researchare borrowed from his own ideas. I thank Shaul for giving me the opportunity to do researchwith him and to learn from his experience; but most of all, I thank him for the good times Ihad being at the Technion during the last four years, working under his supervision.

� My second supervisor, Je�rey Rosenschein, for some helpful discussions that inspired myinterest in interactions among intelligent agents. His work on negotiation protocols is one ofthe main reasons for my interest in interaction strategies.

� Robert Axelrod, who sent me the computer code of the strategies participated in his famousIterated Prisoner's Dilemma tournament.

� Hiroyuki Iida and Aske Platt, for sending me their thesis papers.

� Keren Arnon, my colleague, for her careful proofreading.

� My wife Jeny. Her inspiration was invaluable throughout this whole process. Without herlove and support I could never succeed.

� My children, Adi, Omer, and Matan, for their love and interest in their father's work, evenwhen it seemed that this research was taking me a little bit too long.

Page 3: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Contents

1 Introduction 3

1.1 Multi-agent Systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4

1.1.1 The environment : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4

1.1.2 The agents : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5

1.1.3 The interaction process : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5

1.2 Learning in Multi-agent Systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6

1.2.1 Learning interaction strategies : : : : : : : : : : : : : : : : : : : : : : : : : : 7

1.2.2 A model-based interaction strategy : : : : : : : : : : : : : : : : : : : : : : : : 9

1.3 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10

1.4 Thesis Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11

2 Model-based Learning in Repeated Games 13

2.1 Interaction as a Repeated Game : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13

2.2 Interaction with Regular Opponents : : : : : : : : : : : : : : : : : : : : : : : : : : : 16

2.3 Inferring a Best-response Strategy against Regular Opponents : : : : : : : : : : : : : 17

2.4 Learning Models of Regular Opponents : : : : : : : : : : : : : : : : : : : : : : : : : 19

2.4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19

2.4.2 Unsupervised learning of DFA : : : : : : : : : : : : : : : : : : : : : : : : : : 21

2.4.3 Looking for a better hole-assignment : : : : : : : : : : : : : : : : : : : : : : : 24

2.4.4 Iterative US-L� : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26

2.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27

3 Exploring the Opponent's Strategy 28

3.1 Exploration versus Exploitation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28

3.2 Exploration Strategies for Model-based Leaning : : : : : : : : : : : : : : : : : : : : : 30

3.2.1 Undirected strategies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30

3.2.2 Directed strategies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31

iii

Page 4: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

3.3 Lookahead Based Exploration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32

3.3.1 Mixed strategies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33

3.3.2 An almost-best response strategy against a regular mixed model : : : : : : : 35

3.3.3 Learning a regular mixed model : : : : : : : : : : : : : : : : : : : : : : : : : 38

3.4 Related work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 39

3.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 41

4 Experimentation: On-line Learning in Repeated-games 42

4.1 Experimentation Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42

4.2 On-line Learning of Random Opponents : : : : : : : : : : : : : : : : : : : : : : : : : 43

4.3 Model-based Learning vs. Reinforcement Learning : : : : : : : : : : : : : : : : : : : 45

4.4 The Contribution of Exploration to the Agent's Performance : : : : : : : : : : : : : 47

4.5 Experiments against Non-random Opponents : : : : : : : : : : : : : : : : : : : : : : 48

5 Model-based Interaction Strategies for Alternating Markov Games 50

5.1 Alternating Markov Games : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51

5.2 Multi-model Search : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52

5.3 Incorporating the Opponent Model into the Search : : : : : : : : : : : : : : : : : : : 53

5.3.1 The M� algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54

5.3.2 A one-pass version of M�: : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56

5.3.3 Properties of M�: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57

5.4 Using Uncertain Models : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61

5.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 64

6 Adding Pruning to Multi-Model Search 66

6.1 Simple Pruning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66

6.2 A Su�cient Condition for Pruning : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67

6.3 ���: A Pruning Version of M�

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69

6.3.1 The ��� algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69

6.3.2 Correctness of ��� : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70

6.4 ���1p: A Pruning Version of M�

1p : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72

6.4.1 The ���1p algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73

6.4.2 Correctness of ���1p : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 74

6.4.3 Optimality of ���1p : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76

6.5 Average Case Performance - Experimental Study : : : : : : : : : : : : : : : : : : : : 77

6.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79

iv

Page 5: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

7 Practical Issues Regarding Multi-model Search 80

7.1 Multi-model Search in the Checkers Domain : : : : : : : : : : : : : : : : : : : : : : : 82

7.2 Learning an Opponent Model in Zero-sum Board Games : : : : : : : : : : : : : : : : 83

7.2.1 The learning algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 83

7.2.2 OLSY: An Opponent Learning System : : : : : : : : : : : : : : : : : : : : : : 84

8 Conclusions 87

8.1 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87

8.2 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 89

8.3 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90

Bibliography 92

v

Page 6: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

List of Figures

2.1 The payo� matrix for the Prisoner's dilemma game : : : : : : : : : : : : : : : : : : 14

2.2 An architecture for a model-based learning agent in repeated games. : : : : : : : : : 16

2.3 A DFA that implements the strategy TFT for the IPD game. : : : : : : : : : : : : : 17

2.4 An example of a closed and consistent observation table, and a minimized DFAconsistent with the table. �in = fa; bg. �out = f0; 1g. : : : : : : : : : : : : : : : : : : 20

2.5 US-L�: Unsupervised learning algorithm for DFA : : : : : : : : : : : : : : : : : : : : : : 22

2.6 A learning session of US-L�. Holes are marked by squares. : : : : : : : : : : : : : : 23

2.7 The modi�ed consistency loop that tries to solve the inconsistency of the table by changing

the hole assignment prior to adding tests. : : : : : : : : : : : : : : : : : : : : : : : : : : 25

2.8 A learning session of the modi�ed US-L�. : : : : : : : : : : : : : : : : : : : : : : : : 26

2.9 An iterative version of US-L�. The extra parameter i determines a limit on the maximal

number of changes for each hole in the table. : : : : : : : : : : : : : : : : : : : : : : : : : 27

3.1 An example of an opponent model in local minimum. (Left): An Opponent's strategy.

(Right): An opponent model. The model dictates playing \all-d" while the actual best-

response is \play c then all-d". : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29

3.2 Left: The Grim strategy for IPD. Right: the current model held by the agent. An exploratory

action against grim (d), will be followed by falling into the \defection sink" without any

opportunity to get out. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33

3.3 The �BR algorithm that returns an �-best response automaton against a regular mixed model. 36

3.4 Top: The 2 models belong to the mixed model after t mutual cooperations in the IPD game.

Bottom: The search tree spanned by �-BR. The player's actions are marked by solid lines.

The opponent's actions are marked by dashed lines. To save space, duplicated subtrees are

drawn only once. Action c is preferred over d for any > 0:5. : : : : : : : : : : : : : : : 37

3.5 A mixed strategy acquired after t mutual defection in the IPD game. M0 predicts future

defection by the opponent. M1 predicts cooperation. The beliefs are computed in relation

to the model's size and the model's cover according the game history. : : : : : : : : : : : 39

4.1 On-line learning: (Left) The average cumulative reward of the MB-agent attained during the

game. (Right) the average relative utility of the inferred models during the repeated game. : 44

4.2 The average size of the inferred models during the game : : : : : : : : : : : : : : : : : : : 44

4.3 Left: The best exploration parameter for various discount parameters. Right: On-line learn-

ing of random automata with various sizes. : : : : : : : : : : : : : : : : : : : : : : : : : 45

vi

Page 7: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

4.4 The average cumulative reward of the MB-agent and the Q-agent while playing against

random automata during the repeated PD-game. : : : : : : : : : : : : : : : : : : : : : : 46

4.5 The average cumulative reward of di�erent exploration strategies while playing the IPD

against 100 random automata of size 60. : : : : : : : : : : : : : : : : : : : : : : : : : : : 47

4.6 The models learned by a MB-agent after 200 iterations of the PD-game. The best-response

cycles are highlighted. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48

5.1 The computation of the M value of state a. Squares represent states where it is the player's

turn to play. Circles represent states where it is the opponent's turn to play. The opponent

model ' is used to obtain the move selected by the opponent while the static evaluation

function is used to evaluate the resulting state. : : : : : : : : : : : : : : : : : : : : : : : 53

5.2 The computation of the M1 value of position a with depth limit 3. Squares represent nodes

where it is the player's turn to play. Circles represent nodes where it is the opponent's turn

to play. Part (a) shows the two calls to minimax, using f0, for determining the opponent's

choices. Note that the opponent is a maximizer. Part (b) shows the recursive calls of

M1 using f1, for evaluating the opponent choices. : : : : : : : : : : : : : : : : : : : : : : 54

5.3 The set of recursive calls generated by callingM�(a; 3; (f2(f1; f0))). Each call is written next

to the node it is called from. (a) The player (f2; (f1; f0)) calls its model (f1; f0) which calls

its model of the player (f0). The moves selected by (f0) are highlighted. (b) The model

(f1; f0) evaluates the states selected by (f0) using evaluation function f1. (c) The player

evaluates the states selected by its model using f2. : : : : : : : : : : : : : : : : : : : : : 55

5.4 The M� algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56

5.5 The value vectors propagated by M�1p. This is the same tree as the one shown in Figure 5.3 . 57

5.6 M�1p: A version of the M� algorithm that performs only one pass over the search tree. : : : 58

5.7 An example for the sequence of calls produced by M�� . The left �gure shows the calls for the

case of � = 1 while the right one is for the case of � = 0:5. : : : : : : : : : : : : : : : : : 63

5.8 The M�� algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63

6.1 Left: An example of a search tree where ��, using f1, would have pruned node g. However,

such pruning would change the M1 value of the tree for the player P 1 = (f1; (f0;?)). V 0

and V 1 represent the M0 and M1 values of the inner nodes of the tree respectively. Right:

Pruning is possible given a bound on the sum of the functions, jf1 + f0j � 2. : : : : : : : : 68

6.2 An example for the propagation of the sum-bounds up in the search tree. The active models

are P 2 and P 0. P 0 (and therefore P 1) propagates its value from node e. P2 propagates its

value from node d. The theoretical sum-bound computed by the bound lemma is B12 = B0

2 +

2B01 = 3. Indeed, at node b, V2 + V1 = 2:9. : : : : : : : : : : : : : : : : : : : : : : : : : 69

6.3 The ��� algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70

6.4 Three types of pruning applied by ���. The arrows indicate the nodes which e�ect the

bounds for pruning. (Left) After computing the decision of the opponent for successor s0.(Middle) Before calling the second recursive calls. (Right) After computing the player's value

for the opponent's decision. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 71

6.5 The ���1p algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73

6.6 An example of shallow-pruning (left) and deep-pruning (right) performed by ���1p. : : : : : 74

vii

Page 8: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

6.7 The two possibilities when model j agrees to prune: (Left:) At the beginning of the search

at node s, �j < �j , and after evaluating one of the successors, �j is modi�ed to become

bigger than �j . (Right:) At the beginning of the search at node s, �j � �j . : : : : : : : : 75

6.8 An illustration for the proof of the optimality of ���1p. Every directional algorithm must

examine node k. Each of the two alternative assignments a�ects di�erently the value of g. : 76

6.9 The average EBF of ���(left) and ���1p (right) as a function of the sum-bound. : : : : : : 78

6.10 The average EBF of ���(left) and ���1p (right) as a function of the search depth. : : : : : 78

6.11 The average EBF of ���(left) and ���1p (right) as a function of the branching factor. : : : 79

7.1 An example of a practical application of ���. The player's function prefers bishops over

knights and weights them by 3.1 and 2.9 respectively. The model's function prefers knights

over bishops and weights them by 3.1 and 2.9 respectively. The sum of the two functions is

bound by 0.8. In the left tree, ��� prunes node g just as �� would, since node b is worth at

most �0:2 for the player. In the right tree, �� with f1 would prune node g. ��� would not

prune this node since node b has an upper bound of 0.8. For example, if the move leads to

node g is a knight capture, the model will prefer node g (with a value of 0.2). In such a case

node g and therefore node c is worth 0.2 also for the player and determines the Mn value of

the tree. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81

7.2 The average EBF of ���and ���1p as a function of the search depth in the checkers domain. 82

7.3 LearnModel: An algorithm for learning an opponent model from examples. : : : : : : : : 85

7.4 Learning opponent models from examples. : : : : : : : : : : : : : : : : : : : : : : : : : 86

7.5 The performance of the learning system as a function of the number of examples measured

by average points per game. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 86

viii

Page 9: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

List of Tables

4.1 The cumulative reward, relative utility and model-size attained by the agents after400 stages of the PD game. The results are averaged over 100 trials. The standarddeviation is given in parentheses. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43

4.2 The cumulative rewards attained by the MB-agent after 200 stages of the PD gameagainst �ve di�erent opponents. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48

5.1 The di�erence in behavior between Maxn and M�1p at node e in Figure 5.5. The

left column speci�es the steps taken by Maxn, using 3 players: p2 with evaluationfunction f2, p1 with evaluation function f1, and p0 with evaluation function f0. Theright column shows the steps taken by a M�

1p using the player P2 = (f2; (f1; f0)). : : 61

7.1 The results obtained by an iterative deepening version of ��� when played against aniterative deepening version of ��. Both algorithms were allocated the same searchresources { leaf evaluations per move. Each row represents a tournament of 1,000games. The last two columns show the average search depth of the two algorithms. : 83

ix

Page 10: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Abstract

Consider an electronic market where agents can interact and trade. The agents involved in themarket are completely autonomous and act on behalf of their masters. In such a multi-agent

system, where other agents may be potential partners, or competing opponents, an agent shouldhave the ability to identify the other agents' intentions and goals and to be able to predict theothers' future behavior for planning its own behavior.

Designing an \e�ective" strategy for handling interactions in multi-agent systems is extremelydi�cult even in environments that appear to be simple, because its e�ectiveness depends mostlyon the strategies of the other agents involved. However, the agents are autonomous, hence theirstrategies are not known at the time of design. One way to deal with this problem is to endow theagents with the ability to adapt their strategies based on their interaction experience.

This work deals with the problem of designing interaction strategies from a designer point ofview, who looks for a strategy that will enable his agents to interact autonomously with otheragents, without the need of external control. We present a model-based learning strategy whichsplits the learning process into two separate stages. In the �rst stage, the agent infers models ofthe other agents based on past interaction experiences. In the second stage, the agent utilizes thelearned models to design an e�ective strategy for future interactions.

We investigate this approach in two domains. In the �rst one, interactions among the agentsare modeled by the game-theoretic concept of repeated games. In the second one, interactions arerepresented by Alternating Markov games. In the repeated game domain, we present a generalarchitecture for a model-based on-line learner. The learner accumulates the history of the repeatedgame. The history is given to a learning procedure which generates a consistent model. The modelthen is used by the agent to design a best response strategy to direct its future behavior. Thisprocess is repeated throughout the game. We adopt a common restriction in game theory on theopponent models to be regular strategies, i.e., strategies that can be represented by deterministic�nite automata. We show that �nding the best response strategy against a regular strategy canbe done e�ciently given some common utility functions for repeated games. We then show how anadaptive agent can infer a regular opponent model based on its interaction experiences.

Acting according to the acquired best-response strategy may leave unknown aspects of theopponent's behavior unexplored. We overview some exploration methods, originally developed forreinforcement learning, and shows how to incorporate them into model-based learning.

There are two issues that should be considered when dealing with exploration. The �rst oneis the balance between exploration and exploitation, i.e., how much to explore poorer alternativesaccording to the given model for acquiring better knowledge, versus how much to exploit the currentmodel. The second one is the risk involved in exploration { an exploratory action taken by theagent can yield a better model of the other agent but also carries the risk of putting the agent intoa much worse potion. We describe a lookahead-based exploration strategy that deals with these two

1

Page 11: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

issues. It provides a rational tool for the agent to combine the two opposing goals when evaluatingits alternatives. Actions are evaluated according to their expected utility and also according to theirexpected contribution to the acquired knowledge about the opponent's strategy. It also deals withthe risk involved in exploration by searching some stages deeper in order to predict the opponent'sreactions in future stages of the game.

We report some experimental results in the Iterated Prisoner's Dilemma game, demonstratingthe superiority of the model-based learning agent over non-adaptive agents and over reinforcement-learning agents. The experiments also demonstrate the superiority of the lookahead-based explo-ration strategy over the uninformed exploration methods.

In the Alternating Markov games framework, the model-based agent possesses an opponentmodel which is a recursive structure consisting of an evaluation function of the opponent and aplayer model (held by the opponent). We presentmulti-model adversary search, a generalized searchalgorithm that incorporates recursive opponent models into adversary search. The M� algorithmis a generalization of minimax that uses an arbitrary opponent model to simulate the opponent'ssearch.

Pruning in multi-model search is impossible in the general case. We prove a su�cient conditionfor pruning based on a bound on the sum of the player's and the model's evaluation functions.We then present the ��� algorithm that returns the value of a tree while searching only necessarybranches. We prove correctness and optimality of the algorithm and provide an experimental studyof its pruning power. We also describe the M�

� algorithm that allows the use of uncertain modelswhen a bound on the model error is available. We present a learning algorithm that infers anopponent model based on examples of its past decisions. The algorithm generates a large set ofinequalities, expressing the opponent preference of its selected moves over their alternatives, anduses linear programming to solve this set of inequalities.

Experiments performed in the domain of checkers demonstrate the advantage of M� over mini-max. The pruning version ofM�, ���, is more restricted than �� in its pruning decisions. However,even when the two algorithms are allocated equivalent search resources, ��� is able to overcomeits reduced depth of search, by exploiting the opponent model.

An agent that interacts with other agents in MAS can bene�t signi�cantly from adapting to theothers. The model-based framework described in this work can contribute greatly to the agents'performance, and can serve as a general basis for future research on adaptive agents in multi-agentsystems.

2

Page 12: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 1

Introduction

Consider an open market where buyers and sellers meet and interact for trade. The agents involvedin the market are completely autonomous and consider only themselves. They may also havedi�erent preferences, di�erent negotiation skills and di�erent levels of knowledge. In such a system,where other agents may be potential partners, or alternatively, competing opponents, an agentshould have the ability to identify the other agents' intentions and goals and be able to predict theother's behavior. For example, a seller should bene�t a lot if she can predict the buyer's reservationprice while bargaining, and vice versa.

Similar skills can help agents who operate in a multi-agent system, a computational system ofarti�cial agents who operate in a common environment. Today, the advanced computer networkssuch as the Internet provide a solid platform for the realization of such systems. For example,an information gathering agent may have to interact with information supplying agents in orderto obtain highly relevant information at a low cost. Other examples are situations where con ictresolution, task allocation, resource sharing and cooperation among the agents are needed.

In general, designing an \e�ective" strategy for handling interaction is extremely di�cult even inenvironments that appear to be simple because its e�ectiveness depends mostly on the strategies ofthe other agents involved. However, the agents are autonomous, hence their strategies are private.One way to deal with this problem is to endow the agents with the ability to adapt their strategiesbased on their interaction experience [106]. In unpredicted environments, the designer's aim is toincrease the agent's autonomy to enable independent decision making. Adaptation is essential forautonomous behavior.

In this work we present a model-based learning approach for learning interaction strategies. Inthis approach, agents learn models of the other agents, for predicting their future behavior accord-ing to their past behavior. This approach lets the agent adapt to the others during interaction,and tries to reduce the number of interaction examples needed for adaptation, by investing morecomputational resources in deeper analysis of past experiences. We describe a general architecturefor a model-based learning strategy and a study of the advantages and limitations of this method.

The rest of this chapter is organized as follows: Section 1.1 overviews multi-agent systems andtheir main relevant aspects to this work. Learning in multi-agent systems establishes a relativelynew but signi�cant topic of research in arti�cial intelligence. Section 1.2 analyzes some learningapproaches that have been implemented recently for such systems and compares their suitabilityto the questions considered in this work. We focus mainly on learning strategies for handlinginteraction in such a system. Section 1.3 reviews some related works and �nally, Section 1.4concludes with a statement on the goals of the thesis and a thesis overview.

3

Page 13: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

1.1 Multi-agent Systems

Arti�cial intelligence (AI) establishes a central research in planning intelligent agents that are ableto achieve their goals autonomously. One of the common assumptions in AI planning is that theagent operates in an isolated closed environment. In general, the existence of other agents in thesystem, arti�cial or natural, is ignored. Distributed arti�cial intelligence (DAI) [13] is focused onplanning distributed systems of arti�cial agents. There are two main areas of research for suchsystems [30]. One is distributed problem solving (DPS), the agents are planned to achieve theircommon goals by a centralized mechanism. The second, multi-agent systems (MAS), the agentsare fully autonomous. Each agent represents a di�erent entity and operates to achieve its goals ina sel�sh manner.

In DPS, the main issue for research is how to coordinate the activities of the di�erent agentsand how to e�ciently exploit their private knowledge and capabilities for achieving their commongoals. On the other hand, in MAS there is no central control to direct the agents. Each agentshould operate on behalf of its owner independently while considering the existence and activity ofother agents in the environment, sel�sh and autonomous as it is.

There are some reasons for the broad interest in multi-agent systems recently. As distributedsystems they o�er parallelism, robustness and scalability. In particular, they are well suited fordomains which cannot be handled by centralized authority. MAS are also applicable for domainswhich require the integration of multiple interests and the resolution of goal con icts. In general,collective intelligence and emergent behavior are mutual �elds of interest in many research domainssuch as economy, sociology, political science, and many others.

Multi-agent systems can be characterized by three major aspects: the environment occupiedby the agents, the agent themselves, and the interaction processes: agent-agent interaction andagent-environment interaction. In the following, these three concepts will be described in moredetail.

1.1.1 The environment

Generally speaking, the environment of the system is anything external to the agents. It establishesthe domain in which the agents operate. It characterizes the behavior rules for the agents, theoutcome of their actions, and the interaction processes. The environment is open in the sense thatthere is no central authority for control and agents can randomly join and leave. As a generalconvention it is usually assumed that as long as an agent operates in the environment, it followsthe environment's basic rules.

In AI literature environments are commonly represented by a state graph. Nodes of the graphrepresent the states of the world, physical or abstract. At any time, each agent is found in onestate of the environment. Actions performed by the agents cause changes in agents' states, and forsome environments, are followed by a reward, positive or negative, payed to the performing agents.

In general, agents' goals are represented in two alternative ways. One way is by marking some ofthe states as goal states. In these environments, the agent's task is to achieve its goals by reachingone of the goal states. The other alternative is to associate a payo� to each state-action pair, areward that expresses the desirability of the action in relation to the given state. In such a systemthe agent's goal is to accumulate rewards as much as possible.

We can distinguish between certain and predictable environments in which agents have perfectknowledge about the expected results of their actions, to uncertain and unpredictable ones. The

4

Page 14: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

environment can also be characterized by its dynamics, whether it rules change over time or not, andby its accessibility, i.e., whether the agents perceptions reveal the entire state of the environment.

It is important to note that in multi-agent systems each agent faces an environment that containsthe other agents found in the system. As a consequence, the environment of the whole system is acomposition of the personal environments of all the agents. This di�culty makes the problems ofmulti-agent system much more complicated than traditional AI problems and much more di�cultto be solved.

1.1.2 The agents

There has been considerable discussion on the question of the properties that distinguish a program,or a software object, than an agent. In the broad sense, an agent is \...anything that can be viewed asperceiving its environment through sensors and acting upon that environment through e�ectors."[84, p. 33] The agents' behavior is controlled by a strategy, a function that maps situations toactions, and designing an e�ective behavioral strategy for the agents is the main concern of theirdesigners.

One of the conventions in the agency literature is that an agent is an object which in some senseis intelligent and autonomous [108]. The broad meaning of the term intelligence lets many authorsconsider many properties that an agent should have:

� Autonomy: The ability to act without direct intervention in an autonomous way in order toachieve its owner's goals.

� Rationality: The ability to behave in a way which is suitable, or even optimal, for goalattainment.

� Adaptiveness: The ability to adapt its behavior according to the changing circumstances.

� Benevolence: the property of always doing what it is asked to do.

� Pro-activeness: agents do not simply act in response to their environment, they are able toexhibit goal-directed behavior by taking the initiative.

As computational entities, agents are bounded rational [96], a term that characterizes the agents'computational limitations followed by their limited computational resources. The limited rational-ity is determined by the set of strategies available for the agents and provides a classi�cation of theagents according to their computational power. In this work we deal with regular agents, i.e., agentswith behavioral strategies that can be represented by �nite state machines. The agents limitationswill also be characterized by their bounded depth of search and by their limited knowledge abouttheir opponents' strategies.

1.1.3 The interaction process

Agent-environment interaction is one of the main �elds of research in AI, and many traditional AIproblems can be considered from an agent-environment perspective. In this work we are especiallyinterested in agent-agent interaction.

Interaction among agents is essential for coordination, con ict resolution, and many other prob-lems raised in multi-agent systems. A rational autonomous agent should consider a temporary delay

5

Page 15: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

in completing its tasks if the expected reward of coordination with others is greater or equal to thecost of delay. For example, a software agent that looks for information in the Web, may bene�t fromcooperation with other software agents during the search. Assume that two autonomous agentssearch for information in a shared knowledge-base at the same time. The agents can complete theirown task independently. Alternatively, they can cooperate by sharing their tasks. Each agent shalllook for the information required by the other agent concurrently while it looks for its own goals.The cost of search may be higher because of the extra work required but there is a possibility tobe rewarded due to the reduced time of search.

The interaction process claims communication skills for the interacting agents. Recently, someagent-communication languages such as KQML [32] have been developed for allowing communica-tion and coordination in multi-agent system. In this work we consider a lower level of communi-cation in which agents communicate by signaling each other their chosen actions. We deal withtwo types of agent-agent interaction. The �rst one is a simultaneous interaction represented by agame. Games are common tools in game-theory for representing interactions among rational agents[65]. They encapsulate the major aspects of autonomy and rationality of the agents and provide adecision-theoretic approach for analyzing the decision process performed by the agents. For sucha game, each agent has a �nite set of alternatives (moves) that can be chosen, and the game out-come is determined according to the joint moves performed by the agents. At any time tick, eachagent performs its chosen move and observes the game outcome, which follows the joint actionsperformed simultaneously by all the active agents. We only consider non-cooperative games, thatis, no communication is allowed among the agents before the game. As a consequence, agents canonly communicate and signal each other by performing their chosen actions.

We also deal with a more complicated interaction process represented by an alternating Markov

game [63]. For these games, the results of interaction among the agents depend on the currentstate of the world in which they operate. Game rules are changed repeatedly as a function of thecurrent state of the world.

Alternating Markov games are a convenient paradigm for representing interaction among agents.They can serve as a model for complicated interactions among rational agents, in which the in-teraction process is described by a sequence of alternating moves performed by the agents. Theygeneralize two-player board games such as chess and checkers; examples that provide a well knowntestbed for research since the early days of AI [91]. Interaction strategy for such games is usuallydescribed by a decision procedure, usually based on a lookahead search.

The multi-agent framework is presented as a tournament. The agents interact with each other bycouples, i.e., we only deal with two-player games, but an agent's performance is measured in relationto the performance of the whole population. Furthermore, the tournament environment simulatesthe main properties of multi-agent system: the agents operate autonomously by considering theirown behalf. Moreover, the interaction strategies of the agents are private and are not known priorto and during the time of interaction.

1.2 Learning in Multi-agent Systems

Multi-agent systems are typically complex with respect to their structure and functionality. Evenin environments that appear to be simple it is extremely di�cult, or even impossible, to correctlydetermine the behavior and activities of the agents involved at the time of design. A complete designwould require a prior knowledge for any agent that might be involved { how it will interact withthe environment and with the other agents. For autonomous agents, with private knowledge and

6

Page 16: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

goals, such a requirement is not reasonable. These kinds of problems can be reduced by endowingthe agent with the ability to adapt and to learn, that is, to improve its performance according toits accumulated experiences.

Learning in multi-agent systems establishes a relatively new topic in AI. It extends learning insingle-agent systems since the agents can learn by exchanging information and experiences. It alsoaddresses the issue of adaptation to other agents that operate in the system. For a collection ofrecent works in this area and a comprehensive bibliography see [106, 74].

We distinguish between two kinds of learning according to the feedback available to the learningagents. The feedback is assumed to be provided by the environment, or by the other agents.

� Supervised learning. The feedback speci�es the desired activity for the learner and the objec-tive of the learner is to match this desire as closely as possible. In this case, the environment(or the other agents) acts as a teacher.

� Unsupervised learning. The feedback speci�es the utility of the actual activity for the learnerand the learner objective is to maximize this utility, on the basis of trial and error.

We also distinguish between distributed to collective learning. For the �rst one, each agentadapts its behavior separately. In this case, the di�erence between single-agent learning and multi-agent learning is mainly pronounced by the consequences of mutual adaptation. A typical questionfor research in such systems is the global improvement in the system performance according to theindividual adaptations. Sen et al. [90] and Littman and Boyan [64] describe multi-agent systemsof learning agents in which the agents adapt their strategy independently but the performance ofthe whole system, according to some given external criteria is improved.

For collective learning, agents learn together as a group. Tan [100] addresses a system ofcooperative learning agents that share their experiences with each other. He shows better resultsdue to cooperation. In Co-learning [94, 95], numerous simple agents interact and share the sameadaptive rule to adapt their interaction strategy. Co-learning is mainly focused on the emergent ofconventions. If the adaptive rule is \satisfying", the agent's behavior will converge to equilibria.

A lot of coordination mechanisms, usually used in economical markets, suggested by severalauthors for coordination among agents. For some examples of electronic markets see [68, 107]. Shaw& Whinston [93], and Weiss [105], describe MAS as a classi�er system. The agents, representedas classi�er rules, learn to coordinate their actions for accumulating maximum rewards from theenvironment. The agents compete for the right to operate in the system and to perform the systemtasks, by a bidding mechanism. Agents suggest bids that are determined according to the �tnessof the agents' skills to the given task, and according to the strength of the agent which depends onthe amount of rewards it has achieved already. Agents o�ering the highest bids receive the rightto operate. When receiving a reward from the environment, it is divided among the performingagents and the active agents that operated in the past, according to the bucket-brigade algorithm.Such a learning process guarantees that the agents will learn to coordinate their actions and tooperate in situations in which their activity is really pro�table.

1.2.1 Learning interaction strategies

As we have already mentioned, we are especially concerned with the question of designing anagent that should interact e�ectively with other agents in multi-agent systems. Agents usuallydesigned to achieve their masters' goals by trying to maximize some given utility measurement.

7

Page 17: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

The designer task is to devise an interaction strategy that will maximize the payo� received by theagent. A common technique used by adaptive agents in MAS for interaction with the environmentand with other agents is reinforcement learning (RL) [90, 95, 87]. RL is based on the idea that thetendency to produce an action should be reinforced if it produces favorable results, and weakenedif it produces unfavorable results. Littman [62] analyses interaction among agents as a Markovianprocess where the environment changes is state due to the mutual action performed by the agents. Inthe Markovian game framework, he deals with zero-sum games representing two-player interaction,and describe minimax-Q, an RL algorithm that enable the learning agent to adapt its interactionstrategy, under the pessimistic assumption that the opponent will always perform the worst possiblealternative.

Sandholm & Crites [87] describe the interaction process in MAS as a repeated game betweenthe agents, a framework similar to the one used in this work. They show that an RL agent is ableto �nd an optimal interaction strategy against a stationary simple opponent, but it needs about100,000 interaction examples for convergence. Their results point out the major weakness of thislearning approach { its slow convergence. An e�ective interaction strategy is acquired only afterprocessing a large number of examples. During the long period of adaptation the agent pays thecost of interacting sub-optimally. Bui et al. [14] describe a di�erent learning process based oncased-based reasoning which tries to reduce the amount of communication needed for adaptation.

Another common learning technique used by adaptive agents is genetic algorithms [5, 33, 71, 7].In this framework, an interaction strategy is represented by a function from the histories of therepeated-game to the set of the agent's actions. Strategies are represented by chromosomes andlearning is simulated by an evolutionary process { alternative strategies compete in a simulatedtournament and the society of strategies is evolved according to evolutionary operators such ascrossover and mutation.

The evolutionary simulation can be treated as a search in the space of strategies. An e�ectivestrategy might be found if the �tness criterion for examining the competitive strategies, representsthe agent's goals e�ectively. The main di�culty in this approach is that the evaluation of inter-action strategies is di�cult. The strength of a strategy depends mainly on the type of the otherstrategies involved in the common environment. A successful strategy in one society of strategiesmay completely fail in a di�erent one.

Another framework for learning interaction strategies is based on supervised learning. Goldman& Rosenschein [43] describe a cooperative scenario in which agents learn to coordinate their actionsby communicating and sending each other classi�ed examples of their activity. Each agent actsas a teacher for its partner. The agents then have to apply the concept they have learned and tocoordinate their own behavior according to the teacher's behavior.

The \learning to act" framework addressed by Khardon [55], presents a scenario in which theagents learn to interact by inferring a strategy consistent with a set of examples supplied by ateacher. The teacher is assumed to be an experienced agent in the environment for which theadaptive agents should better follow for achieving better performance.

Supervised learning methods are commonly used by adaptive game-playing agents. In the al-ternating Markov game, the task of inducing an interaction strategy is mainly focused on inferringan evaluation function of game-positions. The task of inferring a function based on examples isequivalent to book learning technique { inferring an evaluation function from a list of moves per-formed by human experts [25, 103, 86]. The evaluation function to be learned should be consistentwith the experts preferences, i.e., it should prefer the moves selected by the experts.

8

Page 18: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

1.2.2 A model-based interaction strategy

Supervised learning described in the above subsection is not suitable for agents that should trusttheir own experiences only and do not have a teacher to learn from. In the absence of an assistingteacher, unsupervised learning methods should be considered. Reinforcement learning is one ofthe most popular tools developed in the last decades for unsupervised learning. A major problemwith this approach is its slow convergence. An e�ective interaction strategy is acquired only afterprocessing a large number of interaction examples. During the long period of adaptation the agentpays the cost of interacting sub-optimally.

In this work we present a model-based learning approach that tries to reduce the number ofinteraction examples needed for adaptation, by investing more computational resources in deeperanalysis of past interaction experience. The learning agent makes use of an explicit model of theother agent's strategy to generate expectations about its behavior, which can then be checkedagainst its actual performance. Learning, in this approach, is a matter of noticing a failure inprediction, identifying the parameters that are relevant to the cause of the failure, and adjustingthese parameters.

Model-based learning splits the learning process into two separate stages. In the �rst stage, thelearning agent infers a model of the other agent based on previous interaction examples. In thesecond stage the agent utilizes the learned model to design an e�ective interaction strategy for thefuture.

A model-based agent begins the game with an arbitrary opponent model. At any stage of thegame, the agent updates the opponent model by applying its learning algorithm to the currentsample of the opponent's behavior. It then �nds the best response against the current model todirect its future behavior. The e�ciency of this adaptive strategy depends mainly on the two mainprocesses involved: 1) The �nding of the best response against a given model; 2) the learningprocess of the opponent model.

Note that the model-based agent (MB-agent) adjusts a model for any other agent which itinteracts with. Since di�erent agents can have di�erent strategies, the MB-agent should tail aspecial interaction strategy for each one of the other agents. This strategy demands the agent havea recognition ability to identify the opponent which it faces. Identifying the other agents might betoo complicated for a bounded-rational agent, and this di�culty grows with the number of agentsinvolved. The model-based approach is suited especially to multi-agent systems where only fewagents are involved. In such a system, the MB-agent can utilize the long sequences of interactionswith the others to adjust a proper response for each one of them.

We study the model-based approach using the two frameworks of interaction mentioned above.In the repeated two-player game framework the objective of each agent is to look for an interactionstrategy that maximizes its expected sum of rewards in the game. We restrict the agents' strategiesto strategies that can be represented by deterministic �nite automata (regular strategies) [83]. First,based on previous work [73], we show that �nding the best response strategy can be done e�cientlygiven some common utility functions for repeated games. Second, we show how an adaptive agentcan infer an opponent model based on its interaction experience.

In the alternating Markov game framework, an agent's strategy is represented by a search proce-dure based on its evaluation function and an opponent model that it possesses. We describe M�, ageneralization of minimax [91], that utilizes the opponent model to predict the opponent reactions.We also describe a supervised-learning method that enable the agent to adapt its opponent modelduring the sequence of games.

9

Page 19: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

1.3 Related Work

Learning in repeated-games has received a lot of attention in the game-theory literature, especiallyin the context of the expected outcome when the game is played by learning players. A simpleform of model-based learning in game theory is �ctitious play [37]. In such a play, the agents aremyopic, i.e., they only consider the current stage of the game. Each player updates its own beliefsabout the opponents' choices for the next interaction, and then chooses the optimal response tothose beliefs. The players model each other with a distribution over the set of alternatives, andcompute the best response to the current models of the other agents. It can be shown that fortwo-player zero-sum games, �ctitious play converges to Nash-equilibrium1 .

Kalai & Lehrer [53] describe a repeated game among Bayesian players who consider the futureof the game. Each one starts with subjective beliefs on the individual strategies used by each of itsopponents. It then uses these beliefs to compute its own optimal strategy. In addition, the playersupdate their subjective beliefs by Bayesian updating rule throughout the game. They show thatunder some constraints, the game between Bayesian players will converge to Nash equilibrium.

The above model implicitly assumes that players are at least capable of solving the problem of�nding the optimal-response, given the strategies of the other players. However, the best-responseproblem is intractable for the general case. Gilboa & Samet [39] deal with bounded rationalplayers with restricted computational abilities. They describe a model-based learning strategy forrepeated games that learns the best-response against any regular strategy. Their method is based onexhaustive search in the space of automata and therefore, is impractical for computational boundedagents. Their procedure enumerates the set of all automata and chooses the current opponent modelto be the �rst automaton in the sequence that is consistent with the current history. Fortnow &Whang [34] show that for any learning algorithm there is an adversary regular strategy for whichthe learning process will converge to the best response in at least exponential number of stages.One way of dealing with this complexity problem is by limiting the space of strategies available forthe opponent [35, 80]. Mor et al. [70] follow this paradigm and show that for a limited class ofregular strategies, the best-response can be learned e�ciently.

The agent's assumptions on its opponents also plays a central role in multi-agent learning.Rosenschein & Breese [82] show that an agent should better adapt its playing strategy to itsassumptions on the rationality of its opponent. Gmytrasiewicz & Durfee [40, 41] suggest a recursivemodeling method (RMM) as a method that the agents can use for intelligent coordination with otheragents. In the recursive method, the agent models the other agents strategies and recursively, theother agents' beliefs on its own strategy. They show that the di�culty concerned with the in�nitestructure of beliefs in RMM is bypassed since the solution, given the agent's state of knowledgeand its preferences, converges to the game-theoretical optimal solution.

Tambe [98, 99] describes a multi-agent environment of air-combat simulations. An intelligentautomated pilot in this environment faces the challenge of tracking the behaviors of other pilots,while interacting with them. Air-combat domains are also used by Rao & Murray for studyinginteraction in MAS [78]. Combat pilots ascribe the behaviors of their opponents with goals andintentions. Then, they use the models to predict the future behavior of their opponents.

Brafman & Tennenholtz [81] formalize the modeling approach by describing agents as qualitativedecision makers. The agents behavior is determined according to their beliefs, preferences and theirdecision strategy. Model construction is viewed as a constraint satisfaction problem in which theagent model should be consistent with the agent past behavior and a general background knowledge.

1We say that two strategies are found in a Nash-equilibrium if each one them is the best-response against the

other.

10

Page 20: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

In the Alternating Markov game framework, the minimax approach, which uses a symmetricalinformation model, has been studied extensively and has reached a state where it is harder to makefurther signi�cant progress. Korf [59] outlined a general direction for incorporating opponent modelsinto adversary search. Jansen [49] proposes speculative play, a playing strategy that speculates anon-optimal play by the opponent. In a later work, Jansen [50] analyzes KQKR chess end-gameand shows that speculative play can have an advantage over traditional minimax play. Iida et al.[46, 47], describe algorithms that incorporate opponent models into the player's decision procedure.The opponent's decision procedure is assumed to be minimax with a di�erent evaluation function.Recently, Meyer et al. [67] describe a system that is able to play through a process of anticipation.This is done by building and re�ning a model of the adversary's behavior in real time duringthe game. Their architecture is based on two classi�er systems, one for modeling the opponent'sstrategy and one for implementing the player's strategy.

1.4 Thesis Overview

The main goal of this research is to investigate the problem of designing an agent that shouldinteract with other agents in a multi-agent environment. We deal with this problem from thedesigner point of view who looks for a strategy that enables his agent to interact autonomouslywith other agents without the need of external control.

Unsupervised learning forces the agent to rely only on its own experiences. Most of the unsuper-vised learning methods, such as reinforcement learning, demand toomany examples for convergence.An e�ective interaction strategy will be acquired only after processing a large number of interactionexamples. In this work we address a model-based learning approach that tries to reduce the numberof interaction examples needed for adaptation, by investing more computational resources in deeperanalysis of past interaction experience. The learning agent possesses models of the other agentsand follows the best-response against their adopted models. In the case of a failure in prediction,it updates the wrong model to agree with the counterexamples.

We investigate this approach in two domains. For the �rst one, interactions among the agentsare represented as repeated games. For the second one, interactions are represented by AlternatingMarkov games. The objective of the agent in both domains is to look for an interaction strategythat maximizes its expected sum of rewards in the game.

Chapter 2 describes repeated games and a general architecture for model-based learning strategysuitable for this framework. We restrict the agents' strategies to (regular strategies) [83], i.e.,strategies that can be represented by deterministic �nite automata . First, based on previous work[73], we show that �nding the best response strategy can be done e�ciently given some commonutility functions for repeated games. Second, we show how an adaptive agent can infer an opponent

model based on its interaction experience.

The model-based approach gives priority to actions with the highest expected utility, accordingto the current accumulated knowledge. But such policy does not consider the e�ect of the agent'sbehavior on the learning process, and ignores the contribution of agent's activity to the explorationof the opponent's strategy. The agent therefore must make a tradeo� between the wish to exploitits current knowledge and the wish to explore other alternatives, to improve its knowledge for betterdecisions in the future. In chapter 3 we describe methods for incorporating exploration into model-based learning. At early stages of learning, the model-based agent sacri�ces immediate rewards toexplore the opponent's behavior. The better model resulting will then yield a better interactionstrategy.

11

Page 21: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

One problem with the existing exploration methods is that they do not take into considerationthe risk involved in exploration. An exploratory action taken by the agent tests unfamiliar aspectsof the other agent's behavior which can yield a better model of the other agent. However, suchan action also carries the risk of putting the agent into a much worse position. We present thelookahead-based exploration strategy that evaluates actions according to their expected utility, totheir expected contribution to the acquired knowledge, and to the risk they carry. Instead of holdingone model, the agent maintains a mixed opponent model, a distribution over a set of models thatre ects its uncertainty about the opponent's strategy. Every action is evaluated according to its longrun contribution to the expected utility, and to the knowledge regarding the opponent's strategy,expressed by the posterior probabilities over the set of models. Risky actions are detected byconsidering their expected outcome according to the alternative models of the opponent's behavior.

In Chapter 4 we show experimentally the superiority of a model-based agent over non-adaptiveagent and over reinforcement-learning agent with no explicit model. The experiments also presentthe requirement of exploration for on-line learning and show the superiority of lookahead-basedexploration over uninformed exploration methods.

Chapter 5 describes Alternating Markov games and the multi-model search framework. In thisframework, the model-based agent possesses an opponent model which is a recursive structureconsisting of an evaluation function of the opponent and a player model (held by the opponent).M

�, a search algorithm described in this chapter, is a generalization of minimax that utilizes thiskind of recursive structure as an opponent model.

In chapter 6 we describe some methods of for incorporating pruning into the search algorithm.We prove su�cient conditions over the opponent model that enable pruning and describe two multi-model pruning algorithms that utilize these conditions. We prove correctness and optimality of thealgorithms and provide an experimental study of their pruning power. Chapter 7 describes analgorithm for learning an opponent model and demonstrate how M

�, combined with the learningalgorithm, can be incorporated into a game-playing program.

Finally, Chapter 8 summarizes the main contributions of the thesis, discusses some advantagesand limitations of the suggested approach, and points out some questions for future research.

12

Page 22: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 2

Model-based Learning in Repeated

Games

Game theory and decision theory provide the formal tools for making research on interaction amongrational agents [65, 12]. Interaction is formalized as a game and the interacting agents as playerswho take their actions according to the game rules. Each player has a set of game moves whichstands for its alternatives, and a utility function that stands for its preferences over the game-outcomes. Game theory is mainly concerned with the solution of the game { the moves selected bythe players, given their private preferences and beliefs.

We are interested in a sequence of interactions among the agents. For such a sequence, inaddition to considering the game outcome, the agents should also consider the e�ect of their ownbehavior on the behavior of the other agents. For example, a seller who considers decreasing itsprice while bargaining, should also consider the e�ect of this discount on the buyer's expectationsfor similar behavior in future interactions.

The repetition of interactions also enables the players to adapt their behavior due to their pastexperience. In situations where the activity of one agent a�ects the utilities of the other agents, theoptimal behavior must be conditioned on the expected behaviors of the other agents in the system.Since the behaviors of the other agents which play the game are unknown in advance, the agentsshould better adapt to the others during the repeated game.

This chapter addresses a model-based learning strategy for repeated games. The interactionstrategy adjusts the beliefs of the model-based agent on the behaviors of the other agents, andfollows the best-response against the updated models. In Section 2.1, we outline our basic frameworkfor a model-based adaptive agent and describe a general architecture for a model-based strategy forrepeated games. In general, inferring the model-based strategy is computationally hard. Section2.2 adopts a common convention in game-theory and restricts the set of strategies available for theopponent to regular strategies. In Section 2.3, we show methods for inferring best-response strategyagainst regular opponents. In Section 2.4, we present an unsupervised algorithm for inferring amodel of the opponent's strategy from its input/output behavior. Portions of this chapter haveappeared earlier in [19, 18, 21].

2.1 Interaction as a Repeated Game

To formalize the notion of interacting agents we consider a framework where an encounter betweentwo agents is represented as a two-player game and a sequence of encounters as a repeated game;

13

Page 23: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

both are tools of game-theory.

De�nition 1 A two-player game is a tuple G = hR1; R2; u1; u2i, where R1; R2 are �nite sets of

alternative moves for the players (called pure strategies), and u1; u2 : R1 � R2 ! < are utility

functions that de�ne the utility of a joint move (r1; r2) for the players.

For example, the Prisoner's dilemma (PD) is a two-player game, where R1 = R2 = fc; dg andu1; u2 are described by the payo� matrix shown in Figure 2.1. c (Cooperate) is usually considered ascooperative behavior and d (Defect) is considered as aggressive behavior. Playing d is a dominantstrategy for both players, therefore, the expected outcome of a single game between rational playersis (d; d).

c

c dII

I 33 5

01

10

5d

Figure 2.1: The payo� matrix for the Prisoner's dilemma game

A sequence of encounters among agents is described as a repeated game, G#, based on therepetition of the game G an inde�nite number of times. G will be called the stage game of G#. Atany stage t of the game, the players choose their moves, (rt1; r

t2) 2 R1 � R2, simultaneously.

De�nition 2 A history, h(t) of G#, is a �nite sequence of joint moves chosen by the agents until

the current stage of the game.

h(t) =(r01; r

02); (r

11; r

12); : : : ; (r

t�11 ; r

t�12 )

�(2.1)

� denotes the empty history. H(G#) is the set of all �nite histories for G#. hi(t) is the sequence

of actions in h(t) performed by player i.

De�nition 3 A strategy for player i, si : H(G#)! Ri i 2 f1; 2g, is a function that takes a history

and returns an action. Si is the set of all possible strategies for player i in G#.

De�nition 4 A pair of strategies (s1; s2) de�nes a path { an in�nite sequence of joint moves, g,

while playing the game G#:

g(s1;s2)(0) = �

g(s1;s2)(t+ 1) = g(s1;s2)(t)jj�s1(g(s1;s2)(t)); s2(g(s1;s2)(t))

� (2.2)

g(s1;s2)(t) determines the history h(t) for the repeated game played by s1; s2.

De�nition 5 A two-player repeated-game based on a stage game G is a tuple G# =< S1;S2; U1; U2 >.

S1;S2 are sets of strategies for the players, and U1; U2 : S1�S2 ! < are utility functions. Ui de�nes

the utility of the path g(s1;s2) for player i.

De�nition 6 sopt(sj ; Ui) will be called the best response for player i with respect to strategy sj and

utility Ui, i� 8s 2 Si, [Ui(sopt(sj; Ui); sj) � Ui(s; sj)].

14

Page 24: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

In this work we consider two common utility functions for repeated games. The �rst is thediscounted-sum function:

Udsi (s1; s2) = (1� i)

1Xt=0

tiui(s1(g(s1;s2)(t)); s2(g(s1;s2)(t))) (2.3)

0 � i < 1 is a discount factor for future payo�s of player i. It can also be thought of as theestimation of player i for the probability that the game will be allowed to continue after the currentmove. It is easy to show that Uds(s1; s2) converges for any < 1.

The second is the limit-of-the-means function:

Ulmi (s1; s2) = lim

k!1inf

1

k

kXt=0

ui(s1(g(s1;s2)(t)); s2(g(s1;s2)(t))): (2.4)

We assume that the players' objective is to maximize their utility function for the repeated game.

The Iterated Prisoner's Dilemma (IPD) is an example of repeated game based on the PD gamethat attracts signi�cant attention in the game-theory literature. Tit-for-tat (TFT) is a simple, wellknown strategy for IPD that has been proven to be extremely successful in IPD tournaments [6].It begins with cooperation and imitates the opponent's last action afterwards. The best-responseagainst TFT with respect to U lm is to play \always c" (all-c). The best-response with respect toUds depends on the discount parameter [6]:

sopt(TFT; Uds)) =

8><>:

all-c 2

3�

all-d � 1

4

\Alternate between c and d" otherwise.

In most cases there is more than one best response strategy. For example, cooperation with TFTcan be achieved by the strategy all-c, by another TFT, or by many other cooperative strategies.

One of the basic factors a�ecting the behavior of agents in MAS is the knowledge that theypossess about each other. In this work we assume that each player is aware of the other player'sactions, i.e. R1; R2 are common knowledge, while the players' preferences, u1; u2, are private. Insuch a framework, while the history of the game is common knowledge, each player predicts thefuture course of the game di�erently. The prediction of player i, g(si;sj), is based on the player'sstrategy si and on the player's belief about the opponent's strategy, sj. sj will be called an opponentmodel.

Note that this framework of incomplete information is slightly di�erent than the frameworkusually studied in game theory. Aumann and Mashler [4] study repeated games with incompleteinformation in which players lack information on their expected payo�s; they are uninformed tothe entries in the stage-game matrix. The main di�erence from our framework is the treatmentto the missing knowledge. The players deal with the missing information by holding a subjectivedistribution on various possible payo�matrices. The uninformed player hold a model the opponent'sutility function where each possible game-matrix belong to the distribution is associated with adi�erent type of opponent's behavior. In our framework the uninformed player holds a model ofthe opponent's strategy instead of holding a model of its utility function. In chapter 5.1 we usethis alternative framework when dealing with alternating Markov games.

How can a player best acquire a model of its opponent's strategy? One source of informationavailable for the player is the set of examples of the opponent's behavior based on the history ofthe game. Another possible source of information is observed games of the opponent with otheragents.

15

Page 25: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

De�nition 7 An example of the opponent's strategy is a pair (h(t); rtj) where h(t) is a history of

the repeated game and rtj is the action performed by the opponent at stage t of the game. A set of

examples Dj will be called a sample of the opponent's strategy. A learning algorithm L receives a

sample of the opponent's strategy Dj and returns an opponent model sj = L(Dj). For any example(h; rj) 2 Dj, we denote rj as Dj(h). We say that a model sj is consistent with Dj if for any

example (h; rj) 2 Dj, sj(h) = Dj(h).

Note that any history of length t provides a sample Ej(h(t)) of t+1 examples of the opponent'sbehavior, since any pre�x of h(t) is also a history of the game, Ej(h(t)) = f(h(k); rkj)j0 � k � tg.Therefore, any sample of the opponent's strategy is a pre�x-closed set of examples (a set of sequencesis pre�x-closed i� every pre�x of every member of the set is also a member of the set).

Given a learning algorithm Li, and a utility function Ui, we can de�ne the strategy sUi;Lii of amodel-based learning agent as

sUi;Lii (h(t)) = s

opti (Li (Ej(h(t))) ; Ui) (h(t)):

The above de�nition yields a model-based player (MB-agent) that adapts its strategy during thegame. A MB-agent begins the game with an arbitrary opponent model s0j , �nds the best responses0i = s

opt(s0j ; Ui), and plays according to the best response, r0i = s0i (�). At any stage t of the game,

the MB-agent acquires an updated opponent model by applying its learning algorithm Li to thecurrent sample of the opponent's behavior, stj = Li(Ej(h(t))). It then �nds the best responseagainst the current model, sti = s

opt(stj; Ui), and plays according to the best response rti = sti(h(t)).

Figure 2.2 illustrates the general architecture of an on-line model-based learning agent for repeatedgames.

Best Response Inferencej

t

SiL

h t( )

i

t

j

tr r,( )

i

ts

js

i

tr

j

tr

Figure 2.2: An architecture for a model-based learning agent in repeated games.

The e�ciency of this adaptive strategy depends mainly on the two main processes involved: 1)Finding the best response against a given model; 2) Learning process the opponent model. In thefollowing sections we will deal with these processes in more detail.

2.2 Interaction with Regular Opponents

Generally, implementation of the model-based learning strategy is too complicated, and even impos-sible sometimes, for agents with bounded rationality (agents with limited computational resources).

16

Page 26: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

For example, Knoblauch [56] shows a recursive strategy (a strategy that can be represented by aTuring machine) for IPD, without recursive best response.

There is an extensive line of research in game theory on the problem of playing repeated gamesagainst computationally bounded opponents [72, 34, 36]. See Kalai [52] for a survey on bounded

rationality in repeated games. The typical approach is to assume that the opponent's strategy isa member of some natural class of computationally bounded strategies. A natural measure of thecomplexity of a given strategy is the amount of memory needed for implementation. In this workwe adopt a common restriction that the opponent uses regular strategies, i.e. strategies that can berepresented by deterministic �nite automata (DFA) [83, 1]. The complexity of a regular strategycan be measured by the number of states of the minimal automaton implementing the strategy.

A DFA (Moore machine) is de�ned as a tuple M = (Q;�in; q0; �;�out; F ), where Q is a nonempty �nite set of states, �in is the machine input alphabet, q0 is the initial state, and �out is theoutput alphabet. � : Q� �in ! Q is a transition function. � is extended to the range Q � ��

in inthe usual way:

�(q; �) = q

�(q; s�) = �(�(q; s); �)

F : Q ! �out is the output function. M(s) = F (�(q0; s)) is the output of M for a string s 2 ��in.

jM j denotes the number of states of M . A strategy for player i against opponent j is representedby a DFA Mi where �in = Rj and �out = Ri. Given a history h(t), the move selected by Mi isMi(hj(t)). For example, Figure 2.3 shows a DFA that implements the strategy TFT for the IPDgame.

TFT

c d

c d

d

c

Figure 2.3: A DFA that implements the strategy TFT for the IPD game.

2.3 Inferring a Best-response Strategy against Regular Opponents

The deterministic nature of regular strategies reduces the di�culty in �nding the best response.Papadimitriou and Tsitsiklis [73] prove that the best response problem can be solved e�ciently forany Markov decision process with respect to U

lm and Uds. In this section we reformulate their

theorem for DFA and provide a constructive proof that presents e�cient methods of �nding thebest response strategy against any given DFA.

Theorem 1 Given a DFA opponent model Mj at state qj. There exists a best response DFA

Mopti (hMj; qji; Ui) such that jM opt

i j � jMjj with respect to Ui = Udsi and Ui = U

lmi . Moreover, M

opti

can be computed in time polynomial in jMjj.

Proof. Let Mj = (Qj; Ri; qj0; �j; Rj; Fj) and assume w.l.o.g. that qj = q

j0.

17

Page 27: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Ui = Ulmi : Let umax be the maximal payo� for player i in the stage-game G, and let n = jMjj.Consider the underlying directed graph of Mj where each edge, (qj; �j(qj; ri)), is associatedwith the non-negative weight w(qj; �j(qj; ri)) = umax � ui(ri; Fj(qj)). It is easy to show thatthe in�nite sum problem for computing U

lmi is equivalent to �nding a cycle in the graph

that can be reached from qj0 with a minimum mean of weights (the minimum mean-weight

cycle problem). The best response automaton will follow the path from qj0 to the minimum

mean-weight cycle and will then force the game to remain on that cycle.

The following procedure, which is a version of the shortest-path problem for directed graphs,�nds the minimum mean-weight cycle that can be reached from q

j0 [27]:

1. Let Pk(q) be the weight of the shortest path from qj0 to the state q consisting of exactly

k edges. P0(q) : : :Pn(q) can be computed as follows:

P0(q) =

(0 q = q

j0

1 otherwise

Pk(q) = minq02Qj

fPk�1(q0) + w(q0; q)g

2. return minq2Qj

max0�k<n

�Pn(q)� Pk(q)

n � k

The algorithm returns the average weight of the minimum mean-weight cycle and can beeasily modi�ed to return the cycle itself. It is polynomial in the size of the graph, and thiscompletes the proof for U lm.

Ui = Udsi : In this case the best response can be found by dynamic programming. Let W (q; r) bea matrix in which every entry (q; r) is equal to the expected discounted sum of rewardscorresponds to performing action r at state q 2 Qj . The matrix entries can be computediteratively by summing the instant expected reward of performing action r with the expecteddiscounted future rewards:

W (qj ; ri) = ui(ri; Fj(qj)) + imaxr0i2Ri

W (�j(qj ; ri); r0i): (2.5)

W entries can be initialized to arbitrary values; by repetitive computation of Equation 2.5,the table entries will eventually converge on a stable solution [11]. The best response forplayer i, given any state qj of the opponent model, is:

opt(qj) = argmaxri2Ri

W (qj ; ri): (2.6)

The best response DFA is M opti (hMj; qji; Ui) = (Qi; Rj; q

i0; �i; Ri; Fi), where Qi = Qj , qi0 = qj,

Fi(qi) = opt(qi), and �i(qi; Fj(qi)) = �j(qi; Fi(qi)). Mopti is always in a parallel state to Mj

and always reacts optimally against it.

The above theorem shows how to �nd the best response automaton e�ciently for any opponent'sDFA. It is important to note that the best-response procedure can be extended to any �xed numberof automata, by �nding the best-response against the product automaton [38]. The best-responseproblem becomes non-polynomial when the opponent plays according to mixed strategies and thesize of the set of automata is not known in advance, (the opponent's automaton is chosen from a�nite set of automata according to a given distribution) [8]. Furthermore, the problem of �ndingthe best response automaton with bounded number of states is NP-complete [72].

18

Page 28: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

2.4 Learning Models of Regular Opponents

In the last section we have shown algorithms for �nding the best response strategy against a givenopponent automaton. The opponent DFA model can be either supplied by an external source orlearned on-line by the agent. This section describes algorithms for acquiring opponent DFA modelsbased on interaction experience.

2.4.1 Introduction

Under the assumption that the opponent's strategy can be modeled as DFA, and according to theOccam Razor assumption that smaller models might have a better prediction power, the MB-agentshould infer the smallest DFA that is consistent with the sample of the opponent's behavior in thepast in order to predict the opponent's actions in the future.

Finding the smallest �nite automaton consistent with a given sample has been shown to beNP-Complete [42, 2]. It has also been shown that the minimal consistent automaton cannot beapproximated by any polynomial-time algorithm [76].

Thus, passive modeling of a given automaton from arbitrary sample seems to be infeasible.Trakhtenbrot and Barzdin [102] describe a polynomial time algorithm for constructing the smallestDFA consistent with a uniform complete training set. The training set includes all possible stringsup to a given length.

Angluin [3] deals with an arbitary sample and describes the L� algorithm that e�ciently infersa minimal automaton model using a minimal adequate teacher, an oracle that answers membershipand equivalence queries. The computational time is polynomial in the number of states of theautomaton and the longest counterexample supplied by the teacher.

Another scenario was studied by Rivest and Schapire [79]. Their procedure simulates iteratedinteractions of a robot with an unknown environment, and is based on L

�. Instead of an adequateteacher that answers queries, the learner is permitted to experiment with the environment (ma-chine). When the learner is requested by L

� to simulate a membership query for a speci�c input,it operates the machine and observes the result. If the model's prediction is di�erent from theactual behavior of the machine, the learner treats it as a counterexample. To experiment withthe machine, the learner is required to have an ability to reset it. In the case of an unresetablemachine, Rivest and Schapire's algorithm uses `homing sequences' instead, a special sequence ofactions that according to the machine's outputs, determines the �nal state of the machine. Givenan input parameter � > 0, their procedure approximately identi�es a minimal consistent DFA, witha probability of at least 1� �, in time polynomial in the machine's size, the length of the longestcounter example, and in the logarithm of 1

�.

L� maintains an observation table (S;E; T) for representing the target DFA. S � ��

in is apre�x-closed set of strings. Each element s of S represents the state �(q0; s). E � ��

in is a su�x-closed set of strings called tests. T is a two dimensional table with one row for each elementof S [ S�, where S� = fs�js 2 S; � 2 �ing, and one column for each element of E. row(s)denotes the table row associated with s. The table entries, T (s; e), are members of �out andrecord the output of the machine for the strings se. Table rows are partitioned into equivalenceclasses, C(s) = frow(s0)jrow(s0) = row(s)g, where each class is associated with one of the modelstates. A DFA M is consistent with an observation table (S;E; T) i� for any entry (s; e) in T ,M(se) = T (s; e).

19

Page 29: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw
Page 30: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

is easy. For example, the assignment that inserts the same output value for all the holes must belegal. We call it the trivial assignment. The problem becomes much harder if we look for a legalassignment that yields a closed and consistent table (S;E; T). We call it the optimal assignment.

Theorem 2 (Gold) The problem of �nding an optimal assignment for an observation table that

covers a given sample is NP-hard.

Following Gold's results, an exhaustive search for an optimal assignment for a given table istoo hard. L� overcomes this complexity problem by submitting membership queries to the teacherto �ll the table holes. It then arranges the table to be closed and consistent and constructs theconsistent model M(S;E; T ). It then submit equivalence query and the teacher either approvesor returns a counterexample. L� builds a new table that covers the new example and repeats theprocess.

2.4.2 Unsupervised learning of DFA

The L� algorithm constructs an optimal assignment for a given table by consulting the teacher.Such an approach is not suitable for repeated game against an autonomous self-interested agent. Wepropose to deal with this problem by considering unsupervised approaches (\Unsupervised learning"usually denotes learning with unclassi�ed examples. We use this term to point out that a teacher isnot available). During encounters with the opponent, the adaptive agent holds a consistent modelwith the opponent's behavior in the past, and exploits the model to predict its behavior in thefuture. When a new example arrives, it can be a supporting example or a counterexample. In the�rst case, the model does not need any change. In the second case, the agent should extend themodel to agree with the new counterexample.

We have developed an algorithm named US-L�, (unsupervised L�), that extends the model

according to the following three guiding principles:

� Consistency: The new model must be consistent with the given sample.

� Compactness: A smaller model might have a better prediction power.

� Stability: The new model should be similar to the previous model as much as possible.

At �rst, the algorithm constructs a new observation table that covers the data, including the newcounterexample. Table construction requires a teacher for answering membership queries. In theabsence of such a teacher the agent consults its previous model following the stability principle.After that, it arranges the table to make it closed and consistent, and constructs a new modelconsistent with the new table.

US-L� maintains the same observation table as L� does. At the beginning, the algorithm

inserts all the pre�xes of the examples into S, and constructs S� to include all their extensions. Eis initialized to include the empty test �. Entries of the table are �lled as follows: When an entry(s; e) is supported by a past example, it is assigned the example's output value and it is markedas a permanent entry. When a table entry is not supported by a past example, it is assigned anoutput value predicted by the previous model, and it is marked as a hole entry.

The algorithm then arranges the table to make it consistent. In the case of inconsistency, thereare two S-rows, s1 and s2, belonging to the same equivalence class, but their �-extensions do not

21

Page 31: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

(i.e., there are � 2 �in, and e 2 E, such that T (s1�; e) 6= T (s2�; e)). Inconsistency is solved byadding the test �e into E, an extension that separates row(s1) and row(s2) into two di�erentequivalence classes and yields an addition of at least one new state to the model.

Next, the algorithm arranges the table to become closed exactly as L� does. When the table isnot closed, there is s 2 S� without any equal row in S. US-L� moves s from S� into S and foreach � 2 �in adds s� into S�. Figure 2.5 shows a pseudo-code of the algorithm.

Algorithm: US-L�(D;M; t)

D: a set of past examples of the target DFA.

M : The current model consistent with D.t = (s; �): a new example of the machine's behavior

D D [ ftg

if D(s) 6= M(s), ft is a counterexample g

Init (S;E; T ):

S all pre�xes of D8s 2 S and � 2 �in

if s� 62 S, S� S� [ fs�g

E f�g

8s 2 S [ S�, T (s; �) Query(s; �)

Consistency:

While not Consistent(S; E; T )�nd two equal rows s1; s2 2 S, � 2 �in, e 2 E, such that T (s1�; e) 6= T (s2�; e)

E E [ f�eg

8s 2 S [ S�, T (s; �e) Query(s; �e)Closing:

While not Closed(S; E; T )

�nd s 2 S� such that 8s0 2 S, row(s0) 6= row(s)move s into S

8� 2 �in

S� S� [ fs�g8e 2 E, T (s�; e) Query(s�; e)

return M(S;E; T )

Query(s; e):

if (s; e) is supported by D

mark (s; e) as a permanent entry, return D(se)

else

mark (s; e) as a hole entryif (s; e) has a tied entry (s0; e0) fs0e0 = seg

return T (s0; e0)

elsereturn M(se)

Figure 2.5: US-L�: Unsupervised learning algorithm for DFA

Figure 2.6 describes an example of a learning session of the algorithm. Assume that �in = fa; bg,�out = f0; 1g. The current model, described in Figure 2.6(a), is consistent with D. When acounterexample t = (abb; 1) arrives, the algorithm initializes the observation table shown in Figure2.6(b). The table is inconsistent. One inconsistency is row(�) = row(ab) while row(b) 6= row(abb).This inconsistency is solved by adding a test b into E. See Figure 2.6(c). A new inconsistency isintroduced by row(�) = row(a) while row(b) 6= row(ab). This inconsistency is solved by addinga test bb to distinguish between the two rows. See Figure 2.6(d). Now the table is consistent and

22

Page 32: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

closed. Figure 2.6(e) shows the new model M(S;E; T ), returned by US-L�, that is consistent withD [ ftg.

Μ=

D={(λ,0), (a,0), (ab,0)}

t=(abb,1)

0a,b

E

aab

abb

baa

abaabbaabbb

λ 0001

00000

S

λ

= aab

abb

baa

abaabbaabbb

λ 0001

0000

S

λ

01

0000

0

0

bE

0

0

=

aab

abb

baa

abaabbaabbb

λ 0001

0000

S

λ

01

0000

0

0

bE

0

0

bb

1

0000

0

0

0

00 0

a

b

M

b

a

1

ab

a,b

[a] [b] [c]

[d] [e]

0

Figure 2.6: A learning session of US-L�. Holes are marked by squares.

Theorem 3 If D is a set of examples of the machine's behavior, M is a DFA consistent with D,

and t is a new example. Then US-L�(D;M; t) eventually terminates and outputs a model consistentwith D [ ftg, with size bounded by jM j+ jtj. Moreover, if k is the size of the set of all pre�xes of

the examples in D [ ftg, then the total running time, and the size of the observation table used by

US-L�, are bounded by a polynomial in k and jM j.

To prove the theorem we need the following two lemmas:

Lemma 4 The consistency loop terminates after at most jM j + jtj iterations and outputs a con-

sistent table.

Proof. De�ne row(S) = frow(s)js 2 Sg. First, we shall bound the number of equivalence classes inrow(S). The new counterexample t can change the status of at most jtj entries in row(S) to becomepermanent and not to be supported by M . The number of equivalence classes of rows withoutchanged entries is bounded by jM j because any such row is supported by M and is associated withone of M state. The number of equivalence classes in row(S) can increase up to at most jM j+ jtj,because M is consistent with all unchanged entries in row(S), and there are at most jtj rows withan entry not supported by M . Hence, the number of equivalence classes in row(S) is bounded byjM j+ jtj.

During the consistency loop, at each iteration we add a test into E. Adding a test increasesthe number of equivalence classes in row(S) by at least one. Therefore, the number of iterations

23

Page 33: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

is bounded by jM j + jtj. This bound holds also for the size of E and for the length of the testsin E. Termination occurs only when the table is consistent, therefore, the loop is terminated andoutputs a consistent table.

Lemma 5 The closing loop at the end of the algorithm terminates after at most jM j iterations,does not change the consistency of the table, and outputs a closed table.

Proof. At the beginning and during the closing loop all entries in row(S�) are holes and thereforeare �lled by the model M . Therefore, the number of equivalence classes in row(S�) is boundedby jM j because any row in S� is associated with one of the states of M . At each iteration onerow is moved from row(S�) into row(S). Therefore, the number of iterations is bounded by jM j.Termination occurs only when the table is closed so the loop is terminated and output a closedtable. Consistency of the table is not changed during the closing loop because any row that ismoved into row(S) is distinct from any other rows in row(S). Thus, inconsistency cannot be addedto the table.

Now we can prove the correctness of the algorithm.

Proof of Theorem 3. If t is a supporting example the proof is trivial.If t is a counterexample, the algorithm constructs (S;E; T ) to cover D[ftg. During table construc-tion, any permanent entry is �lled with the value of its supporting example, and any hole entry is�lled with an old tied entry or by the current model. This �lling strategy assures the legality ofthe assignment. All operations during the consistency and closing loops are polynomial in k andjM j. The two previous lemmas show that the consistency and the closing loops at the end of thealgorithm terminate and output a closed and consistent table in time bounded by a polynomial ink and jM j. The size of the new model learned by the algorithm is determined by the number ofequivalence classes in S. Therefore, it is bounded by jM j+ jtj. The size of E is also bounded byjM j + jtj. The size of S [ S� is bounded by a polynomial in k and jM j. Therefore, table size isalso bounded by a polynomial in k and jM j.

2.4.3 Looking for a better hole-assignment

In the beginning of this section we outlined three principles that serve as guidelines for USL*:Consistency, compactness and stability. The algorithm guarantees that the resulting model isconsistent, and the hole-assignment policy that �lls holes by using the previous model supportsstability. However, Theorem 3 guarantees that the size of the learned model is at most in the sizeof the given sample. This result is not satisfying { in the worst case no generalization is made bythe algorithm. In this subsection we introduce a modi�ed algorithm that promotes compactnessover stability by trying to avoid extensions of the table as much as possible.

When the modi�ed algorithm arranges the table to make it consistent, it �rst attempts tochange the hole assignments to solve inconsistency instead of adding new tests. When T (s1�; e) 6=T2(s2�; e), if one entry is a hole and one is permanent, then the hole entry gets the output valueof the permanent entry. When both are hole entries, the longer one gets the output value of theshorter one. Changing a value of a hole entry causes all its tied entries to get the same value forpreserving the legality of the assignment. In order to prevent an in�nite loop, any changed entry

24

Page 34: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Consistency:While not Consistent(S;E; T)�nd two equal rows s1; s2 2 S, � 2 �in, e 2 E, such that T (s1�; e) 6= T (s2�; e)if both (s1�; e) and (s2�; e) are permanentor both have been changed before fwe must distinguish between rows s1 and s2gE E [ f�eg8s 2 S [ S�, T (s; �e) Query(s; �e)

else

if one entry is a hole which was not changed before (assume (s2�; e))or both entries are holes which were not changed before (assume s1 � s2)T (s2�; e) (and its tied entries) T (s1�; e)mark (s2�; e) (and its tied entries) as changed

Figure 2.7: The modi�ed consistency loop that tries to solve the inconsistency of the table by changing thehole assignment prior to adding tests.

is marked and cannot be changed again. A new test is added only if both entries are permanent orboth entries were already changed . Figure 2.7 shows the modi�ed version of the consistency loopof the algorithm.

Figure 2.8 describes an example of a learning session of the modi�ed algorithm with the sameinput as in Figure 2.6. The current model M described in Figure 2.8(a) is consistent with D.When a counterexample t = (abb; 1) arrives, the algorithm initializes the observation table shownin Figure 2.8(b). The table is inconsistent. One inconsistency is row(�) = row(ab) while row(b) 6=row(abb). This inconsistency is solved by changing the hole value of (b; �) (Figure 2.8(c)). Anotherinconsistency is row(a) = row(ab) while row(ab) 6= row(abb). This inconsistency can not be solvedby changing hole values. Therefore the algorithm adds the test b into E to distinguish betweenthe two rows (Figure 2.8(e)). Now the table is consistent and closed. Figure 2.8(f) shows the newconsistent model M(S;E; T ).

Theorem 6 Let k be the size of the set of all pre�xes of the examples in D[ftg. The total runningtime of the modi�ed US-L�, and the size of the observation table used by the algorithm, are bounded

by a polynomial in the sample size k and the model size jM j.

Proof. During the modi�ed consistency loop there are two kinds of iterations. For `adding-test'iterations, any addition of a test distinguishes between at least two rows in row(S). There areat most k classes in row(S). Therefore, the number of `adding-test' iterations is bounded by k.For `changing-hole' iterations where inconsistency is solved by changing the hole assignment, sinceany hole can be changed only once, the number of such iterations is bounded by the table sizejS [ S�j � jEj � k

2(1 + j�inj). The total number of iterations of the modi�ed consistency loop isbounded by the sum of the two bounds, k2(1 + j�inj) + k.

For the closing loop, all entries of row(S�) were �lled by the old model or by tied entries thatwere changed during the consistency loop. There are at most k2(1+ j�inj) changed entries and eachone has at most 2k tied entries. Therefore, there are at most 2k3(1 + j�inj) such rows in row(S�)

25

Page 35: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Μ=

D={(λ,0), (a,0), (ab,0)}

t=(abb,1)

0a,b

aab

abb

baa

abaabbaabbb

λ 0001

00000

S

λ

aab

abb

baa

abaabbaabbb

λ 0001

00000

S

λ

≠1

=

aab

abb

baa

abaabbaabbb

λ 0001

0000

S

λ

1

E

=a

ababb

baa

abaabbaabbb

λ 0001

0000

S

λ

1

01

0000

1

0

0

bE

0 0

1

a

ba

b a,b

M

[a] [b] [c]

[d] [e] [f]

EE

Figure 2.8: A learning session of the modi�ed US-L�.

with at least one changed entry. Rows without changed entries belong to row(S�) were �lled bythe old model. Hence, at most jM j distinct such rows can be added into row(S) during the loop.The number of iterations can be bounded by counting the number of distinct rows that might beadded to row(S), 2k3(1 + j�inj) + jM j.

The consistency loop and the closing loop are both bounded by a polynomial in k and jM j andthis complete the proof.

2.4.4 Iterative US-L�

When US-L� changes a hole value for solving inconsistency of the observation table, the changedhole and all its tied entries are marked and cannot be changed again to prevent in�nite loops.Without this limitation, an inconsistency that was solved before, can appear again while solvinganother inconsistency. However, there are many situations where re-changing hole values mightsave extensions of the table. For example, when two equal rows that include changed holes becomeunequal after adding a test into E, all holes belonging to these rows can be changed again in orderto solve other inconsistencies of the table.

To reduce the size of the models generated by the algorithm, we modi�ed US-L� to receive alimit parameter that speci�es the maximal number of times a hole entry can be changed. Basedon this modi�ed algorithm, we developed an iterative version of US-L�, called IT-US-L�, that callsUS-L� with an increasing limit parameter. The algorithm stops when the alloted computationalresources are exhausted or when it sees no improvement for several iterations. Figure 2.9 showsthe pseudo-code of the iterative algorithm.

26

Page 36: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Algorithm: IT-US-L�(D;M; t)

D: a set of past examples of the target DFA.M : The current model consistent with D.t = (s; �): a new example of the machine's behavior

i 0, M�1 M

Repeat

Mi US-L�(D;Mi�1; t; i)i i+ 1

Until no improvement in model size for several iterations orcomputational resources exceeded

return Mi

Figure 2.9: An iterative version of US-L�. The extra parameter i determines a limit on the maximal numberof changes for each hole in the table.

Theorem 6, which shows the e�ciency of the modi�ed version of US-L�, can be easily extendedfor the iterative version. The bound on the number of iterations of the closing loop is not changedfor any i. The bound on the number of iterations of the consistency loop can increase up tok2(1 + j�inj) � i+ k.

2.5 Summary

This chapter presents a model-based approach for learning interaction strategy. Interaction ismodeled by the game-theoretic concept of repeated-game. We present an architecture for a model-based on-line learner. The learner accumulates the history of the repeated game. The history isgiven to the learning procedure which generates a consistent model. The model is then given to aprocedure that infers a best-response strategy which is used to decide on the action that should betaken. This process is repeated throughout the game.

Inferring best-response strategies is computationally hard. Therefore, a computational-boundedlearner should limit the class of strategies available for its opponent. This chapter considers regularopponents for which a best-response strategy can be e�ciently computed. Learning a minimal DFAmodel without a teacher was proved to be hard. We present an unsupervised algorithm, US-L�,based on Angluin's L� algorithm. The algorithm e�ciently maintains a model consistent with itspast examples. When a new counterexample arrives, it tries to extend the model in a minimalfashion.

There is a major di�culty associated with the learning model, concerns experimentation. Themodel-based strategy may leave some aspects of the opponent's behavior unexplored. It givespriority to actions with the highest expected utility, does not experiment other alternatives, andmay be stuck in sub-optimal solutions. The next chapter deals with this di�culty and describesexploration methods for exploring the opponent's behavior.

27

Page 37: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 3

Exploring the Opponent's Strategy

The model-based strategy described in the previous chapter gives priority to actions with thehighest expected utility, according to the current opponent model. But such behavior does notconsider the e�ect of the player's actions on the learning process, and ignores the contribution ofthe player's activity to the exploration of the opponent's strategy. In order to interact e�ectively,the agent should explore the opponent's strategy in order to be able to plan its future activity.

The goal of this chapter is to develop exploration strategies for model-based agents. We �rstshow how to incorporate RL exploration methods into model-based learning. One problem withthe existing exploration methods is that they do not take into consideration the risk involved inexploration. An exploratory action taken by the agent tests unfamiliar aspects of the other agent'sbehavior which can yield a better model of the other agent. However, such an action carries also therisk of putting the agent into a much worse position. We will present the lookahead-based explorationstrategy that evaluates actions according to their expected utility, to their expected contributionto the acquired knowledge, and to the risk they carry. Instead of holding one model, the agentmaintains a mixed opponent model, a distribution over a set of models that re ects its uncertaintyabout the opponent's strategy. Every action is evaluated according to its long run contributionto the expected utility, and to the knowledge regarding the opponent's strategy, expressed by theposterior probabilities over the set of models. Risky actions are detected by considering theirexpected outcome according to the alternative models of the opponent's behavior. Portions of thischapter have appeared earlier in [20, 22, 23].

3.1 Exploration versus Exploitation

Repeated interactions with another agent can be treated as a sequence of decisions under incom-plete information which the agent should make. The tendency to minimize losses, resulting from theagent's ignorance about the opponent's behavior, requires resolution of uncertainty as early as pos-sible. Hence, in addition to accumulating rewards, the agent should also minimize the uncertaintyabout the opponent's behavior.

In essence, every action has two kinds of expected outcomes:

1. The expected reward according to the current knowledge held by the agent.

2. The expected in uence on the acquired knowledge, and hence, on future rewards expected tobe received due to better planning.

28

Page 38: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

The agent therefore must make a tradeo� between the wish to exploit its current knowledge and thewish to explore other alternatives, to improve its knowledge for better decisions in the future. Thistradeo� is known in decision-theory as the exploration versus exploitation dilemma. Agents thatengage in exploration to the exclusion of exploitation are likely to �nd that they su�er the costof experimentation, without gaining bene�ts of the accumulated knowledge. Conversely, agentsthat engage in exploitation to the exclusion of exploration, are likely to �nd themselves trapped insub-optimal behavior.

A good example for the above dilemma is the multi-armed bandit problem [10, 51]. A machinewith k levers independently pays some amount of money each time a lever is pulled according to a�xed unknown distribution. The problem is to develop a strategy that gains the maximum payo�over time by choosing which lever to pull based on previous experience of lever pulling and theirassociated payo�s. For this problem, exploitation corresponds to choosing the estimated best arm,according to the current knowledge. Exploration corresponds to choosing of another arm for theaim of making estimations more accurate, and making better decisions in the future.

Actions taken by the agent, together with the opponent's responses, are used as examples bythe model-based learner. The \exploratory value" of an example can be evaluated by predicting theexpected value of the revised model as a result of processing the new example. A consistent examplewill strengthen the belief in the current model. A counterexample will lead to a construction of anew better model by the learner.

In supervised learning, it is the teacher's role to provide counterexamples for guiding the learningprocess. In the absence of an assisting teacher, the agent must search for counterexamples by itself.The model-based learner, may encounter counterexamples by chance during the interaction processas long as it possesses a model that is not equivalent to the opponent's strategy. But such anapproach might converge to sub-optimal behavior. The deterministic nature of the model-basedstrategy decreases the likelihood of observing counterexamples, since following the same behaviorrepeatedly prevents exploration of unknown aspects of the opponent's strategy.

The left part of Figure 3.1 shows an example of an opponent's strategy for IPD. The opponentwill defect as long as the player does, but will cooperate forever after one cooperation of the player.The best-response against this strategy is \play c then all-d" which yields a utility of 5

1� . If theplayer uses the model shown in the right part of the �gure then the best-response corresponding tothis model is \all-d" yielding utility of 1

1� which is suboptimal for any � 1

5. A model-based player

holding this model will repeatedly defect, preventing it from exploring action \c" and observingcounterexamples. Hence, the wrong model will never be corrected, and the player will be stuckwith sub-optimal strategy.

c (0)

d c

d (1)

d

c(3),d(5) c(0),d(1)

Figure 3.1: An example of an opponent model in local minimum. (Left): An Opponent's strategy. (Right):An opponent model. The model dictates playing \all-d" while the actual best-response is \play c then all-d".

The above example demonstrates that it is better sometimes to play sub-optimally in orderto explore the opponent's behavior. Action c has low utility according to the current model, butsince it has not been tried already, it is expected to strengthen the belief in the current modelif the expectation will be realized, or to provide a counterexample and therefore a better model.

29

Page 39: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Therefore, action c has high expected exploratory value. Action d, on the other hand, has highexpected utility, but was tried many times already, and therefore has low expected exploratoryvalue.

In the following section we will modify the model-based leaner to take both exploration andexploitation into consideration.

3.2 Exploration Strategies for Model-based Leaning

The balance between exploration and exploitation is one of the main issues in learning control.In recent years, a variety of exploration strategies have been proposed for reinforcement-learningagents. For a comprehensive survey see [101]. In this section we describes some exploration methodsfor active-learning and show how to adopt these methods for model-based strategies.

3.2.1 Undirected strategies

Undirected strategies are based on incorporating randomness into the decision procedure of thelearning agent. The key idea is to associate a positive probability for any state-action pair. Theagent randomly selects an action according to some given distribution and therefore no action isneglected forever.

Random exploration is a simple undirected strategy that gives priority to the optimal actionaccording to the current model but also gives positive probability to the other alternatives. If thereare n alternative actions, an action r will be randomly chosen according to the following probabilitydistribution:

Pr(r) =

(1� � r is optimal�

n�1otherwise.

The exploration parameter, 0 � � < 1, determines the weight given to exploration. As � increases,the agent performs more sub-optimal actions and increases the chance of receiving counterexamples.

The uniform distribution used by random exploration associates an equal probability for everynon-optimal action without utility consideration. A utility-based exploration strategymakes choicesaccording to a distribution determined by the expected utility of the actions. The better an actionis expected to be, the more likely it is to be selected.

Boltzmann exploration [97] is a utility-based strategy that reduces the tendency for explorationwith time. It is based on the assumption that current model improves as learning progresses.Boltzmann exploration assigns a positive probability to any possible action according to its expectedutility and according to a decreasing parameter T called a temperature. Assume that the agent hasto choose among n actions, r1; : : : ; rn, with expected utilities U1; : : : ; Un according to its currentknowledge. The Boltzmann distribution assigns a positive probability for any possible action:

Pr(ri) =eUi=TPn

j=1 eUj=T

:

Actions with high utility are associated with higher probability. The temperature T decreases withthe stage of the game t. Therefore, as learning progresses the distribution becomes more tight onhigh utility actions, and the exploration tendency of the agent reduces. A common temperaturefunction is T = �

t where � < 1 is the exploration parameter. As alpha increases, the rate of decayof the temperature, and therefore the exploration tendency, decreases.

30

Page 40: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

In the model-based framework, the expected utility of an action r is computed by summing theinstant expected reward of executing r and the discounted sum of rewards expected from playingthe best response against the current model. Let hM t

j ; qtji = Li(h(t)) be a regular opponent model

Mtj at state qtj , inferred by the learning algorithm Li given history h(t). The expected utility of

action r is de�ned to be:

Ui(hM tj ; q

tji; r) = ui

�r; M

tj(h(t))

�+

+ iUdsi

�sopt(hM t

j ; qt+1j i; Uds

i ); hM tj ; q

t+1j i

�:

The �rst element in the sum stands for the immediate expected utility from playing r by the player,and M t

j(h(t)) by the opponent. Udsi is the expected discounted sum of rewards for player i, followed

from continuing playing optimally after performing r. qt+1j = �

tj(q

tj ; r) is the new expected state

of the opponent model following action r. sopt(hM t

j ; qt+1j i; Uds

i ) is the best response against theopponent model.

Ui can be computed e�ciently since any game between two automata converges to a �nitecycle, and the discounted sum of rewards of the game-path between the opponent model and thebest response, can be computed by using the following procedure. Let g(hMt

i;qt+1

ii;hMt

j;qt+1

ji) be the

game-path between the the two automata. This path induces an in�nite sequence of states, P , inthe opponent's automaton, that converges to a cycle:

P = (q0; q1; : : : ; qk�1; qk; : : : ; qk+n�1; qk; : : :):

P can be easily found by a simulation of a game played between the two automata. Let rl bethe player's action that changes the model state from ql to ql+1. The expected discounted sum ofrewards for player i is:

Udsi

�hM t

i ; qt+1i i; hM t

j ; qt+1j i

�=

k�1Xl=0

liui(rl; Fj(ql)) +

+ ki

1� ni

k+n�1Xl=k

l�ki ui(rl; Fj(ql)): (3.1)

The �rst element is the discounted sum of rewards attained while following the path leading to thecycle. The second element is the geometric sum of the in�nite run along the cycle.

We can now use Equation 3.1 to de�ne the probability assigned by Boltzmann exploration toaction r, given the current model hM t

j ; qtji:

Pr(r) =eUi(hMt

j ;qtj i;r)=TP

r02RieUi(hMt

j;qtji;r0)=T : (3.2)

3.2.2 Directed strategies

For undirected exploration strategies, the exploration policy of the agent is determined in advanceand does not depend on the actual interactions with which the agent is involved. Directed strategiesattempt to explore the opponent's behavior more e�ciently by using statistics based on the learningexperience of the agent. These methods compute an exploration bonus for each possible action andincorporate this bonus into the decision procedure of the agent. The bonus estimates the expectedcontribution of the action to exploration.

31

Page 41: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Several statistics were proposed for determining the exploration bonus. Sato [88] use action-counts where higher exploration bonuses are given to actions that have been chosen less frequently.Sutton [97] suggests Recency-based exploration, an exploration bonus based on the time that haspassed since the action was taken. Kaelbling [51] suggests Interval estimation, an explorationbonus based on the upper bound of the con�dence interval computed for the expected utility of thepossible actions. Moore & Atkenson [69] propose Prioritized sweeping exploration strategy, wherestates found to be unfamiliar are connected to �ctitious state with high utility. Therefore, the agentis encouraged to investigate unfamiliar areas of the environment.

Directed strategies can be incorporated directly into model-based agents by traversing the his-tory of the game and accumulating the above statistics. For example, the history can be used tocompute an age parameter for every state-action pair. Recency-based exploration can be imple-mented by computing an exploration bonus proportional to the age.

Let Ei(stj; r) be the exploration bonus computed for an action r, according to the current model

stj , and the history h(t). The value for each action is computed by a linear combination of the

expected utility of the given action, Ui, and the exploration bonus, Ei:

Vi(stj ; r) = (1� �)

Ui(stj ; r)P

r02RiUi(s

tj ; r

0)+ �

Ei(stj; r)P

r02RiEi(s

tj; r

0): (3.3)

As for undirected methods, � is the exploration parameter that determines the ratio betweenexploration and exploitation. Thrun [101] deals with the question of combining exploration andexploitation. He shows that dynamic combination, e.g., changing the tradeo� parameter �, duringlearning, might have an advantage over static combination.

To summarize, at any stage t of the game, an exploring MB-agent updates the opponent modelstj = Li(h(t)). It then �nd the best response against the model sti = s

opt(stj ; Ui). When using anundirected exploration method it computes Pr(r) for every action r 2 Ri and randomly selects anaction according to this distribution. When using a directed method it computes the explorationbonus for every action r based on the history statistics, and then selects the action with the maximalcombined utility.

3.3 Lookahead Based Exploration

The exploration methods described in the previous section were developed to handle the problemof being stuck in local-minima. An exploring agent is willing to perform sub-optimal actions inorder to acquire a better model of the opponent yielding a better utility in the long run. It is quitepossible, however, that the exploratory action will indeed yield a better model, but will also leadthe agent to a poor state where even optimal play yields low utility. Sometimes, knowledge is tooexpensive { taking a step into the dark can lead you to the sought treasure, or into a deep chasm.While falling into the chasm, the improved model of the world is not likely to help you much.

For example, the left part of Figure 3.2 describes a strategy for IPD called Grim. A Grimplayer cooperates as long as its opponent cooperates, but never forgives defection { after one singledefection of its opponent it reacts by repeatedly defecting. Assume that an exploring agent holdsthe model described in the right part of the �gure. Defection has lower utility than cooperationaccording to the model, therefore, a non-exploring agent will always cooperate yielding utility of3

1� . An exploring agent will eventually try d and acquire a perfect model of Grim. However,the exploring action d takes the agent into the \defection sink" without any opportunity to get

32

Page 42: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

out, yielding utility of 5�4

1� , which is lower than the utility of cooperation for any � 1

2. Thus,

exploration indeed yielded better knowledge, but the cost of acquiring this knowledge is too high.

Grim

d

c (3)

cc(0)

sub-model

d (1)

d

c (3) c(0),d(1)

c

d (5)d (5)

Figure 3.2: Left: The Grim strategy for IPD. Right: the current model held by the agent. An exploratoryaction against grim (d), will be followed by falling into the \defection sink" without any opportunity to getout.

The above example clearly demonstrates that exploratory behavior requires a better mecha-nism for predicting the risk involved in taking sub-optimal actions. One of the problems with theexploration mechanisms described in Section 3.2 is the use of a utility function that assumes astationary opponent model throughout the expected course of the game. This assumption is notrational for a learning agent who continuously modi�es its opponent model. In order to develop arisk-sensitive exploration strategy we need a mechanism that allow the agent to take into consid-eration the expected revision of its belief while computing the expected utility. Such a mechanismrequires a method for representing the agent's uncertainty regarding its opponent's strategy.

In the rest of this section we describe a risk-sensitive exploration methodology. We �rst describemixed strategies for representing uncertain opponent models, and an almost best response procedureagainst such strategies. We then show a learning mechanism for acquiring uncertain models.

3.3.1 Mixed strategies

The model-based agent described in the previous chapter maintains a regular model of the oppo-nent's behavior. The regular model is incapable of representing the uncertainty of the learning agentregarding the model's prediction. To represent such uncertainty we apply a stronger mechanismfor modeling the opponent strategy: The game-theoretic concept of mixed strategy.

De�nition 9 A mixed strategy for repeated game G# is a pair, (S; �) : H(G#) ! �(Rj), that

maps the set of the game histories, H(G#), into the set of distributions over the player's actions,

�(Rj). S = fs1; : : : ; skg is a �nite set of (pure) strategies for G# called the set of support (SOS).� = (�1 : : : �k) is a probability distribution over SOS called the belief distribution, where �l � 0 for

1 � l � k, andPk

l=1 �l = 1. A regular mixed strategy, (hM; qi; �), is a mixed strategy whose set of

support includes k automata, M = (M1; : : : ;Mk), in states q = (q1; : : : ; qk).

An agent uses a mixed strategy (S; �), by randomly drawing a strategy sl from the set S,according to the belief distribution �, and performing action sl(h(t)).

Mixed strategies can be used as opponent models in two di�erent ways. In the �rst one, weassume that the opponent applies a mixed strategy, i.e., it randomly chooses one of the strategiesfrom the set of support at any stage of the game. Gilboa [38] shows how to play against n regularopponents by solving the best response problem against the product automaton of the opponents'DFA. A similar method can be used to �nd the best response against a regular mixed strategy.First, construct the product automaton of the automata belonging to the set of support, with anoutput function which is probabilistic according to the given distribution. Second, �nd the best

33

Page 43: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

response automaton against the probabilistic product automaton. The problem of �nding the bestresponse against a probabilistic action automaton (PAA) is equivalent to the best response problemagainst a deterministic automaton [36].

The above interpretation still assumes that the agent does not modify its opponent modelthroughout the game-path while computing the utility, and is therefore not suitable for the beliefrevision framework. The second interpretation considers the opponent's strategy to be one of themodels belonging to the set of support, and the belief distribution to re ect the subjective beliefsof the player in the alternative opponent models. This interpretation is suitable for the learningframework studied in this work. By choosing a speci�c action, the agent forces the opponent torespond. The opponent's response enables a revision of the belief distribution over the set of theopponent models. Following the opponent's response, rtj, the beliefs on all models that do notexpect to perform r

tj should be reduced to zero. The beliefs on all models that expect to output rtj

should be increased.

Let (S; �) be the current mixed strategy, and let Pr(rj) =Pk

l=1[�ljsl(h(t)) = rj] be the prob-ability of performing action rj by the opponent at the current stage of the game. The posteriorbelief, �rj , can be computed directly by Bayes law for every 1 � l � k:

�rjl =

8<:

0 sl(h(t)) 6= rj�l

Pr(rj)otherwise. (3.4)

Two useful propositions are implicit in the Bayesian belief-revision process:

� The con�dence of the agent in the opponent's models is re ected by its prior belief distribu-tion. The more the agent is con�dent in one of its models, that is, the belief distribution ismore tight on one of the models in SOS, the more the posterior distribution will resemble theprior distribution for any given response. This is clear in the limit. An agent with absoluteprior con�dence who associates a probability of 1:0 with one of the models, e.g. S1, can learnnothing from new evidence. From equation 3.4, if the prior probabilities �l, are all zero forany Sl 6= S1, the corresponding post-prior probabilities will also be zero.

� The more \surprising" the opponent's response is, the bigger the impact upon the post-prior belief. In terms of Equation 3.4, a surprising response would be associated with lowprobability Pr(rj). The smaller this probability is, the more the post-prior belief divergesfrom the prior belief.

We can now de�ne the utility of the expected game-path between a learning strategy and amixed model, while considering the modi�ed beliefs throughout the path:

Udsi (si; (S; �)) =

Xrj2Rj

Pr(rj)[ui(si(h(t)); rj) + iUdsi (si; (S; �

rj ))]:

Ben-Porath [8] shows that under this interpretation, the best response problem against a regularmixed strategy is NP-complete. He also shows that the problem becomes polynomial in the sizeof the product automaton when �xing the size of the set of support. However, the best responseprocedure described by Ben-Porath is impractical since the product automaton is extremely largeeven for a small set of support. In the next subsection we describe an alternative approach that ismore suitable for a computational bounded agent. We present an e�cient algorithm that returnsan almost best response automaton against a regular mixed model. The level of approximation canbe determined in advance, according to the computational resources available for the agent, at anystage of the game.

34

Page 44: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

3.3.2 An almost-best response strategy against a regular mixed model

In this subsection we develop an algorithm that returns an almost-best response automaton againsta regular mixed model. The strategy returned by the algorithm enables the agent to (almost)optimally balance between exploration and exploitation throughout the game-path expected to beplayed. We also show that this strategy can be computed e�ciently.

Given an opponent's strategy sj , and an approximation parameter, �, an �-best response strat-egy, with respect to sj , guarantees the player a utility which is less than the maximal possibleutility against sj , by at most �.

De�nition 10 A strategy sopt� (sj ; Ui) will be called �-best response for player i with respect to

strategy sj and utility function Ui, i� 8s 2 Si, [Ui(sopt� (sj; Ui); sj) � Ui(s; sj)� �].

Let umax be the maximal value of the stage game G, and Umax =umax

1� be the maximal utility

in the repeated game G#. The algorithm receives a regular mixed model, (hM; qi; �), and a depth

parameter d that is determined by the approximation parameter �, d � ln( �Umax

)

ln( ). The set of support

includes k automata, Ml = (Ql; Ri; q0l ; �l; Rj; Fl); l = 1; : : :k, in states q = (q1 : : : qk), and a belief

distribution � = (�1 : : : ; �k), �l > 0 1. The algorithm returns a pair of an �-best response automatonand the expected utility of the game against the given mixed model.

The algorithm classi�es the states of the mixed strategy to certain states where all the modelspredict the same output, and uncertain states where at least two models disagree on the opponent'scurrent action. For uncertain states, the algorithm splits the models of SOS into disjoint setsaccording to their expected output, M

rj= fMl 2 M jFl(ql) = rjg. Any opponent's action reduces

the number of models in the corresponding set of support since the beliefs in all models that arenot consistent with this action is reduced to zero. For each pair (ri; rj) of a player's action and anopponent's action, we construct a posterior mixed strategy, (hMrj

; qrii; �rj ). The posterior belief,

�rj , over M

rj, can be computed using Equation 3.4. The player's action, ri, determines the new

state of the posterior mixed model, qril = �l(ql; ri); 1 � l � k.

The algorithm calls itself recursively with a reduced depth limit. The recursion stops when thesearch reaches the depth limit, or, when the set of support includes only one model, for which theproblem is reduced to the best response problem against a single automaton.

Certain states do not reveal information to the player, since all the models predict the sameoutput. Therefore, the set of support and the belief distribution cannot be modi�ed. For such states,the models' current states are modi�ed according to the player's action and the best response isfound recursively with a reduced depth limit.

Finally, when the recursion terminates at depth d, the algorithm returns a one-state automaton,A

0, that outputs one of the player's actions for any possible history.

Denote the automaton returned by the recursive call for the pair (ri; rj) by Ari;rj , and the

expected payo� of the game to be played by Uri;rj . Let r�i be the action that maximizes the

expected utility of performing action ri at the given state of the mixed model:

r�i = argmax

ri

Xrj2Rj

Pr(rj)[ui(ri; rj) + Uri;rj ]:

The �-best response automaton begins with r�i at the initial state and then plays Ar�i ;rj , according

to the opponent's action rj.

1We ignore models with a belief of zero.

35

Page 45: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Note that the automaton described above provides a plan of d steps for the player for exploringand exploiting the mixed model. The plan optimizes the player's behavior by considering theexpected utility, and also the expected revealed information throughout the alternative game-paths.This plan is in contrast to the in�nite plan returned by the best response procedure against a singleautomaton, but which does not direct the agent's behavior when the opponent does not play aspredicted. Figure 3.3 describes a pseudo-code of the algorithm.

Procedure �BR�(hM; qi; �); d

�k jM j

if (d = 0) or (k = 0) then return hA0; 0iif (k = 1) then return PureBR(hM; qi)else

if uncertain(q)for each rj 2 Rj

Pr(rj) Pk

l=1[�ljFl(ql) = rj]

Mrj fMl 2M jFl(ql) = rjg

for l = 1 : : : ; kif Fl(ql) = rj then �

rjl

�lPr(rj )

else �rjl 0

for each ri 2 Ri

qril �l(ql; ri); 1 � l � k

hAri;rj ; U ri;rj i �BR�(hM

rj; qrii; �rj ); d� 1

�U maxri

PrjPr(rj)[ui(ri; rj) + U ri;rj ]

r�i argmaxriP

rjPr(rj)[ui(ri; rj) + U ri ;rj ]

A (a DFA that begins with r�i and plays Ar�i ;rj according to rj)else /* q is certain, i.e., all models predict the same action r�j */

for each ri 2 Ri

qril �l(ql; ri); 1 � l � k

hAri;r�

j ; U ri;r�

j i �BR�(hM; qri i; �); d� 1

�U maxri [ui(ri; r

�j ) + U ri;r

j ]

r�i argmaxri [ui(ri; r�j ) + U ri;r

j ]

A (a DFA that begins with r�i and plays Ar�i ;r�

j according to rj)return (A;U )

Figure 3.3: The �BR algorithm that returns an �-best response automaton against a regular mixed model.

Figure 3.4 demonstrates how the �-best response strategy can avoid falling into the defectionsink when playing the IPD against Grim. The top part of Figure 3.4 shows two models held bythe exploring agent, after t mutual cooperations in the IPD game. M0 is TFT which predictsthe opponent to defect after defection but assumes a possible withdrawal by cooperation. M1 isthe Grim strategy. The bottom part of the �gure shows the computation of the best response.The player's actions are marked by solid lines and the opponent's actions are marked by dashedlines. To save space, duplicated subtrees are drawn only once. Action c has no exploration bene�tsince all the models predict cooperation of the opponent after c. The utility of cooperation isU(c) = 3 + 5 + 2 3

1� . The utility of defection is U(d) = 5 + 2 2

1� . Hence, c is prefered over d forany > 0:5. Note that the exploration strategy returned by the algorithm is \c then all-d". Whenthe algorithm searches deeper, the returned strategy will postpone defection for later stages of thegame-path.

The following theorem proves that algorithm �-BR() returns an �-best response strategy against

36

Page 46: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

c

d (5)

dc (0)M0

c (3)

c

d (5)

d

c(0),d(1)c(3) d(1)

M1

q0 q1 q0 q1

21−γ

2γ1−γ

d

dc

2γ1−γ

2γ1−γ

21−γ

c

<M0,q0> <M1,q1>

d0.5 0.5

1−γ 1−γ3 1

c

d

d

c

d

d

c

(<M0,q0>,<M1,q0>),(0.5,0.5),2

dd

c

c d

(<M0,q0>,<M1,q0>),(0.5,0.5),1

c

(<M0,q1>,<M1,q1>),(0.5,0.5),2

c d

c

(<M0,q0>,<M1,q0>),(0.5,0.5),4

c

c d

1(<M0,q0>,M1,q1>),(0.5,0.5),1

d

dc

(<M0,q1>,<M1,q1>),(0.5,0.5),1 (<M0,q1>,<M1,q1>),(0.5,0.5),1

(<M0,q1>,<M1,q1>),(0.5,0.5),2(<M0,q0>,<M1,q1>),(0.5,0.5),2

5 1

(<M0,q1>,<M1,q1>),(0.5,0.5),3(<M0,q0>,<M1,q0>),(0.5,0.5),3

c

2γ∗γ1−γ5+

3+5γ

T2

T1

T1T2

Figure 3.4: Top: The 2 models belong to the mixed model after t mutual cooperations in the IPD game.Bottom: The search tree spanned by �-BR. The player's actions are marked by solid lines. The opponent'sactions are marked by dashed lines. To save space, duplicated subtrees are drawn only once. Action c ispreferred over d for any > 0:5.

the given mixed strategy, with respect to Uds.

Theorem 7 Let � be a positive number. Let �0 = �

Umax, and let d � ln(�0)

ln( ). Let (hM; qi; �) be a

regular mixed strategy. The algorithm �BR((hM; qi; �); d), returns an �-best response automaton,

against (hM; qi; �), with respect to Udsi . The computation time is polynomial in 1

�0and the size of

the maximal automaton belonging to the set of support.

Proof. The proof is by induction on the depth d. For d = 0 the proof is trivial since � � Umax.Hence, any strategy is �-best response. For the induction step, if the number of models is k = 1,then the algorithm returns the best response against a single DFA which is also an �-best response.For k > 1, assume correctness for d� 1. Let s be the strategy returned by the algorithm, and s0 bean arbitrary strategy. We denote the expected utility of s returned by a search to depth d, by Ud,and the expected utility of s0 by U 0

d. We also denote the corresponding approximation parameter �according to the given depth d, by �d. Note that from the relation between � and d, it follows that �d�1 = �d. If the current state is certain, i.e., all models predict the opponent to perform r

�j . The

utility returned by the algorithm is:

Ud = maxri

[ui(ri; r�j) + iUd�1]

induction� maxri

[ui(ri; r�j) + i(U

0d�1 � �d�1)] =

= maxri

[ui(ri; r�j) + iU

0d�1]� �d � U

0d � �d:

A Similar computation can show the claim for uncertain states.

For computing the complexity of the algorithm, note that the maximal number of computationsis bounded by the number of leaves in the search tree, (jRijjRjj)d = O(poly( 1

�0)), where poly() is a

polynomial function. Each such computation is polynomial in the size of the maximal automaton.

37

Page 47: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

3.3.3 Learning a regular mixed model

Describing the opponent model by a mixed strategy requires a learning procedure for acquiringalternative models for the set of support, SOS, and a computational procedure for determining thebelief distribution over SOS. Ideally, the set of support should include all the models consistentwith the given history, but such a set of strategies is in�nite.

For restricting the set of support we concentrate on a subset of models consistent with thegiven history that di�er in predicting the expected opponent's action at the next stage of the game.We use the learning algorithm of the model-based strategy, Li, for constructing these models. Byconcatenating all the possible pairs of actions, (ri; rj), to the given history, and by applying thelearning algorithm Li, to the new expanded histories, we acquire models that are consistent withthe given history and predict the opponent's next action to be rj at stage t + 1 of the game. Theplayer action at stage t + 1, ri, does not a�ect the learning algorithm since the opponent's actionat stage t+ 1 is only a�ected by the previous player's actions. Hence, we can acquire at most jRjjdi�erent models for the set of support.

De�nition 11 Given a history h(t), and a learning algorithm Li. The set of support, SOS1(Li; h(t)),

is de�ned to be the set of models returned by the algorithm Li, applied to the history h(t) concate-nated with one more joint action (ri; rj):

SOS1(Li; h(t)) = fM jM = Li (h(t) � (ri; rj)) ; ri 2 Ri; rj 2 Rjg :

After acquiring the di�erent models for the set of support, the agent should determine its beliefdistribution over the set. Since all models are consistent with the given history, each one can serveas an opponent model. However, there are models that seem to be more reasonable than others.The agent can choose any way for computing its subjective beliefs. The only restriction is thatits beliefs should be non-negative and should be summed to one. According to the Occam razorprinciple, a reasonable belief probability for a given model should be in inverse relation to its size,i.e. the smaller the model, the larger is its associated belief. The beliefs should also be in directrelation to the coverage of the models: the smaller the number of edges of the model that havebeen tried by the agent during the given history, the smaller the agent's belief in the given model.

�j / cover(h(t);Mj)

jMjj :

Figure 3.5 shows an example for a mixed strategy acquired after a history of t mutual defectionsin the IPD game. M0 is a one state model that predicts consecutive defections by the opponent. M1

is a t+1 states model that predicts consecutive cooperations by the opponent after the t defections.Note that the belief on M1 is much smaller than the belief on M0. For large t, it is not reasonableto predict cooperation in the future after t consecutive defections.

Modeling the opponent's strategy by a mixed strategy enables the agent to quantify its un-certainty about the opponent's behavior and to make better decisions under uncertainty. Thestrategies acquired for the set of support were inferred by predicting di�erent responses of theopponent to the given history. For testing the in uence of the agent's behavior on the opponent'sbehavior, we should search some stages deeper, testing the e�ects of the agent's actions on theopponent's expected responses.

When we search d stages forward, we shall look at all possible expanded histories of length t+d,h(t)� (rti; rtj)d. By concatenating all the possible d pairs of actions, (ri; rj), to the given history, and

38

Page 48: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

L

M0

d

c(0),d(1)

L d d dd d d

cd

. . .

c c c d,cc

π0 =

1/(t+1)= π1

t/(t+1)

M1

d

((d,d),(d,d),...(d,d)(*,d)) =

((d,d),(d,d),...(d,d)(*,c)) =

Figure 3.5: A mixed strategy acquired after t mutual defection in the IPD game. M0 predicts futuredefection by the opponent. M1 predicts cooperation. The beliefs are computed in relation to the model'ssize and the model's cover according the game history.

by applying the learning algorithm to the expanded histories, we acquire models that are consistentwith the history h(t), and predict di�erently the opponent responses for the player sequences ofactions. Note that the player action at stage t+ d does not a�ect the learning algorithm since theopponent's action at stage t + d is only a�ected by the previous player's actions. Hence, we canacquire at most jRijd�1jRjjd di�erent models for the set of support.

De�nition 12 Given a history h(t), and a learning algorithm Li. The set of support, SOSd(Li; h(t)),

is de�ned to be the set of the models returned by the algorithm Li, applied to the history h(t), con-catenated with d more joint actions, (ri; rj)

d.

SOSd(Li; h(t)) =

�M jM = Li

�h(t) � (ri; rj)d

�; ri 2 Ri; rj 2 Rj

:

To summarize, for exploring the opponent's strategy using a mixed model, the agent �rstsearches d stages forward for collecting di�erent opponent models to the set of support. It theninfers a belief distribution over this set. Following that, it �nds the �-best response against themixed model and performs a sequence of actions dictated by this strategy. By doing so, theagent rationally balances between exploration and exploitation, and reduces the risk involved inperforming sub-optimal actions. This may carry additional computational cost, but the investmentmay be pro�table when the risk involved in exploration is high.

3.4 Related work

Sequential decision tasks with incomplete information have long been studied in decision theoryand arti�cial intelligence. In all these contexts, an agent plans its actions while learning about theenvironment in order to achieve its goals. Such an agent should be able to deal with the tradeo�between exploration and exploitation of the accumulated information.

Most of the exploration strategies developed in reinforcement learning are of heuristic nature.Recently, some utility-based strategies has been suggested to guide exploration by consideringthe e�ects of a given plan on the reduction of the agent's uncertainty about its model of theworld. Dayan and Sejnowsky [28] claim that the agent's uncertainty should drive its explorationbehavior. They describe a method that uses certainty equivalence approximation that uses the meanvalues of the state variables, instead of the variables themselves, in expressions that determine the

39

Page 49: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

utility of the agent's actions. Optimal experiments design [31] is concerned with the design ofexperiments that are expected to minimize variances of the parameterized model, and therefore,to maximize the con�dence in the given model. When applying this method to sequential decisionmaking, we estimate how an addition of a new training example is expected to change the computedvariance. Cohn [26] applies this method for selecting examples to train an arti�cial neural network.Karakoulas [54] presents a probabilistic algorithm that also applies a similar statistical selectionprocedure to decide how much exploration is needed for selecting a plan.

The lookahead-based exploration method described in this work also drives exploration accord-ing to the player's uncertainty about its opponent model, but also considers the e�ect of explorationon the agent's position. The accumulated knowledge and utility throughout the expected game-pathare evaluated together and allow the agent to (almost) optimally control its behavior.

Model-based learning of interaction strategies in repeated-games has received a lot of attentionin the game-theory literature, especially in the context of the expected outcome of a repeated gameplayed by learning players. Kalai & Lehrer [53] describe a repeated game among Bayesian players.Each starts with subjective beliefs on the individual strategies used by each of its opponents. Itthen uses these beliefs to compute its own optimal strategy. In addition, the players update theirsubjective beliefs by a Bayesian updating rule throughout the game. They show that under someconstraints, a game between Bayesian players will converge to Nash equilibrium. Their work doesnot deal with exploration, but mentions that in a rational choice model, the optimal determinationof when and how to experiment is di�cult.

The above model of learning implicitly assumes that the players are at least capable of solvingthe problem of �nding the optimal response, given the strategies of the other players. However, thebest response problem is intractable in the general case. Gilboa & Samet [39] deal with boundedregular players. They describe a model-based learning strategy for repeated games that learns thebest response against any regular strategy. Their procedure enumerates the set of all automata andchooses the current opponent model to be the �rst automaton in the sequence that is consistentwith the current history. Exploration is achieved by designing a sequence of actions that distinguishbetween the current model and the next consistent automaton in the enumeration. The risk in-volved in exploration is bypassed by assuming that the opponents' strategies are limited to stronglyconnected automata, where there are no \sinks" and there is always opportunities to regret. Forsuch automata, the learning algorithm is guaranteed to converge to the best response in the limit.This learning procedure is based on exhaustive search in the space of automata, and therefore, isimpractical for computational bounded agents.

Fortnow & Whang [34] show that for any learning algorithm there is an adversary regularstrategy for which the learning process will converge to the best response in at least exponentialnumber of stages. One way of dealing with this complexity problem is by limiting the space ofstrategies available for the opponent [35, 80]. Mor et al. [70] follow this paradigm and show thatfor a limited class of regular strategies, the best response automaton can be learned e�ciently.In these methods, exploration is achieved by random walk, that is, incorporating randomness intothe decision procedure of the agent. Such learning methods embark on long exploration sequencesduring the course of the game. The high cost of the exploration sequences diminishes in in�nitegames for the limit-of-the-mean utility function, which measures only asymptotic performance.However, these methods may fail for the discounted-sum utility function which takes into accountalso immediate rewards.

40

Page 50: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

3.5 Summary

This chapter studies exploration methods for model-based learning in multi-agent systems. Ex-ploration is essential for a learning agent while adapting to the other agents, for exploring betteralternatives and for avoiding being stuck in sub-optimal behavior. Several exploration strategies,undirected and directed, have been developed for the reinforcement learning paradigm. Section 3.2describes ways to incorporate these methods into model-based learning.

There are two issues that should be considered when dealing with exploration. The �rst one isthe balance between exploration and exploitation. The second one is the risk involved in exploration.The undirected and directed exploration methods do not o�er a way for a rational agent to considerboth factors when interacting with the others. The lookahead-based exploration strategy presentedin this work deals with these two issues. The opponent's strategy is represented by a mixedmodel { a distribution over a set of strategies. Every action is evaluated according to its long runcontribution to the expected utility and to the knowledge regarding the opponent's strategy. Riskyactions are detected by considering their expected outcome according to the alternative models ofthe opponent's behavior.

The lookahead-based exploration method is computationally expensive. An important assump-tion of this work is that it is rational for the agent to invest in computation in order to saveinteraction. The lookahead-based method allows a rational-bounded agent to control the cost ofcomputation by adjusting the lookahead depth.

To conclude, an agent that interacts with other agents in MAS can bene�t signi�cantly fromadapting to the others by exploring their behavior. The model-based strategy, combined with ex-ploration methods, can serve adaptive agents in multi-agent systems. The next chapter experimentswith the model-based strategy and tests its capabilities in di�erent environments.

41

Page 51: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 4

Experimentation: On-line Learning in

Repeated-games

We conducted a set of experiments to test the capabilities of a model-based learner in repeatedgames. The �rst experiment simulates a scenario where the agent interacts with other unknownagents in a common environment. The opponents are represented by random automata. The secondexperiment compares the model-based approach to reinforcement-learning approach, the third teststhe capabilities of the di�erent exploration methods for repeated game, and the fourth tests theMB-agent against non-random automata.

4.1 Experimentation Methodology

Each experiment consists of 100 repeated IPD games of length 400 between the tested agent andrandomly generated opponents. 100 random opponent automata were generated by choosing arandom transition function and a random output function1. The random machines were minimizedby taking out all unreachable states and by using the DFA-minimization algorithm [44]. The agent'sperformance was measured using the following dependent variables:

1. Relative utility: Since the actual opponent strategy hMj; qtji is known to the experimenter,

the expected utility of a model hM tj ; q

tji can be computed by the in�nite discounted sum of

rewards of the game between the opponent automaton and the best response against thegiven model:

Uexp(hM tj ; q

tji) = U

dsi

�sopt(hM t

j ; qtji); hMj; q

tji�:

The computation of Uexp can be done e�ciently according to Equation 3.1. Instead of usingabsolute utilities, we use relative utility, which is the ratio between the expected utility ofthe current model and the expected utility of the best possible model { the actual opponentDFA:

Ur(hM tj ; q

tji) =

Uexp(hM tj ; q

tji)

Uexp(hMj; qtji)

:

1There are other methods for generating random automata. Lang [61] describes a generator for random DFAwhere the main input parameter is the machine's depth, i.e., the maximal length between two machine states.

42

Page 52: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Relative Utility Av. Cumulative Reward Model sizeAgent Opponent

TFT 0.6 (0.3) 2.24 (0.05) 2.24 (0.05) 2 (0)MB-Agent 0.86 (0.95) 3.37 (0.33) 0.65 (0.06) 12.21 (18.87)Exploring MB-Agent 0.96 (0.53) 3.62 (0.36) 1.03 (0.1) 72.20 (123.1)

Table 4.1: The cumulative reward, relative utility and model-size attained by the agents after 400stages of the PD game. The results are averaged over 100 trials. The standard deviation is givenin parentheses.

2. Average cumulative reward: The cumulative reward of the game divided by the number ofstage games: Pt�1

k=0 ui(rki ; r

kj )

t:

3. Model size: The number of model states.

4.2 On-line Learning of Random Opponents

The goal of the �rst experiment is to compare the performance of three strategies: non-adaptive,adaptive without exploration, and adaptive with undirected exploration. The learning parametersused by the MB-agent are i = 0:9 and a temperature function for the Boltzmann explorationstrategy T = 5 ��t with � = 0:999. A set of 100 repeated IPD games of length 400 were conductedwith each of the agents playing against randomly generated opponents of size 20.

The results obtained are shown in Table 4.1. The adaptive players achieved signi�cantly betterresults than the non-adaptive TFT player. TFT achieves only 2.24 points which is the averagepayo� of the PD game (its random opponents achieved similar results). The adaptive players,which starts with no previous knowledge, achieve much more, 3.62 points by the exploring agentand 3.37 by the non-exploring agent. It is also clear that the exploring model-based agent performedbetter than the non-exploring model-based agent.

Figure 4.1 shows the relative utility of the inferred model and the cumulative reward as afunction of the game-stage, averaged over the 100 trials. The graphs show that both adaptiveplayers converge quite quickly to very successful opponent models. The exploring agent managesto learn models with average relative utility of 0.96. The non-exploring one manages to learnmodels with utility of 0.86.

The graph for the average cumulative reward highlights an interesting phenomenon. At earlystages of the game, the exploring agent pays for the suboptimal decisions induced by its \curiosity".However, the better model generated by the exploring agent pays o� in later stages of the game,and the exploring strategy outperforms the non-exploring strategy dominantly.

Figure 4.2 shows the average size of the models learned by the MB-agent during the game. Itis interesting to note that the average model-size is leveled o� quite quickly for both agents. Thenon-exploring agent infers much smaller models than those inferred by the exploring agent. Thereason for that is that the non-exploring agent is trapped in local minima with sub-models that aresmaller but less utile.

43

Page 53: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

2

2.5

3

3.5

4

0 50 100 150 200 250 300 350 400

Avg

. cu

mu

lative

re

wa

rd

stage of the game

On-line Learning: Cumulative reward

ExplorationNo-exploration

TFT

0.6

0.7

0.8

0.9

1

1.1

0 50 100 150 200 250 300 350 400

Re

lative

utilit

ystage of the game

On-line Learning: Relative utility

ExplorationNo-exploration

TFT

Figure 4.1: On-line learning: (Left) The average cumulative reward of the MB-agent attained during thegame. (Right) the average relative utility of the inferred models during the repeated game.

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250 300 350 400

mo

de

l siz

e

stage of the game

On-line Learning: Model size

ExplorationNo-exploration

Figure 4.2: The average size of the inferred models during the game

44

Page 54: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

The results for the MB-agents address a question of how much exploration is needed duringinteraction. The answer depends on the \greed" of the learning agent. The more weight the agentgives to future payo�s, the more resources it should spend on exploration. This weight is expressedby the discount parameter . We repeated the last experiment for various values of . For each we tried several values of exploration parameter �, used by the Boltzmann exploration strategy,and recorded the one for which the agent achieved the best performance. The left part of Figure 4.3shows the best � for each . As expected, as increases, it is better to invest more on explorationby using larger �.

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

op

tim

al a

lph

a

gamma

optimal expl. parameter vs. ds. parameter

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

0 50 100 150 200 250 300 350 400

Avg

. cu

mu

lative

re

wa

rd

stage of the game

Boltzmann exploration: Diff. DFA-size

D=20D=40D=60

Figure 4.3: Left: The best exploration parameter for various discount parameters. Right: On-line learningof random automata with various sizes.

The right part of Figure 4.3 shows the e�ect of increasing the complexity of the opponentstrategy on the learning process. The three curves show the average results attained by an MB-agent with Boltzmann exploration with � = 0:99, against 100 random automata of sizes 20, 40, and60. The results show that increasing the size of the automata deteriorates the rate of adaptation.Learning e�ective models for complex automata demands more examples and hence the learningprocess converges slower.

4.3 Model-based Learning vs. Reinforcement Learning

In the second experiment we compared model-based learning with reinforcement-learning (RL). RLis based on the idea that the tendency to produce an action should be reinforced if it producesfavorable results, and weakened if it produces unfavorable results. Q-learning [104] is a well-knownRL algorithm that works by estimating the values of all state-action pairs. The Q-value of thepair (s; r), where s is the current state and r is the current action, is estimated on the basis ofexperiences:

Q(s; r) (1� �)Q(s; r) + �(u+ maxr0

Q(s0; r0)):

where u is the current reward, s0 is the new world state, � < 1 is the learning rate, and is thediscounted parameter. The algorithm is guaranteed to converge to the true Q-values when theworld is Markovian and stationary and no state-action pair is neglected forever. The last conditioncan be achieved by using any exploration strategy that guarantees a positive probability for any

45

Page 55: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

action-state pair at any stage of the learning process. In the following experiment the Q-agent usesthe Boltzmann distribution that selects an action according to the probability:

Pr(r) =eQ(s;r)=TP

r02RieQ(s;r0)=T

:

For repeated games, an entire history is needed for representing a game state. In such framework,any state is visited only once and no generalization can be made. A possible alternative is to use a�xed window of previous moves for representing a state. For the iterated PD-game and a windowof width w, the number of states to be handled is 2(2�w). For example, for a window of length 1the game states should be fcc; cd; dc; ddg. A too wide window will yield a too sparse table that willnot allow convergence of the learning process in practical time. A too narrow window can causeperceptual aliasing, i.e., di�erent states can appear identical and therefore can be represented bythe same state.

Q-learning was tried in repeated games against Tit-for-tat (TFT) by Sandholm and Crites [87].The Q-agent succeeded in learning an optimal strategy against TFT using a window of length one,but it needed about 100,000 iterations for convergence. Similar results were obtained by us.

We repeated the previous experiments of on-line learning by comparing a Q-agent with a MB-agent with the discounted-sum utility function. The two agents used the same discount parameter, i = 0:9, and Boltzmann exploration strategy with temperature function T = 50�0:999t. Figure 4.4shows the average of reward of both agents. They both started with random strategies and played400 stages of PD-games against 100 random opponents of size 20. The Q-agent used a window ofwidth 2 and a learning parameter � = 0:1 � 0:999t. The results were averaged over the 100 trails.

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

0 50 100 150 200 250 300 350 400

Avg

. cu

mu

lative

re

wa

rd

stage of the game

MBL vs. RL: Cumulative reward

MB-agentQ-agent

Figure 4.4: The average cumulative reward of the MB-agent and the Q-agent while playing against randomautomata during the repeated PD-game.

It is quite clear that the model-based agent outperforms the Q-agent signi�cantly. The Q-agent fails to learn a reasonable strategy using the given resources. Increasing the width of thewindow and changing the learning parameters did not help. The Q-agent managed to achieveresults comparable to those achieved by the MB-agent in 400 games only when the length of thegame was increased up to 40,000. Therefore, while the Q-agent can achieve results similar to thoseof the MB-agent, it requires signi�cantly more resources (a factor of 100 in the above experiment).

46

Page 56: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

4.4 The Contribution of Exploration to the Agent's Performance

In the following experiment we compare the main exploration methods studied in this work:

1. Boltzmann exploration: An undirected MB-agent with incorporated Boltzmann explorationusing a temperature function T = 5 � �t.

2. Recency-based exploration: A directed MB-agent using recency-based exploration [97]. Forevery action r, it counts the number of stages that have passed since the last time that rwas taken, �(stj; r), where s

tj is the current opponent model. It then computes an exploration

bonus E0(stj; r) =q�(stj; r) and measures the utility of action r according to Equation 3.3.

3. Combined exploration: A combined strategy that uses a combination of the directed andundirected strategies: it uses recency-based exploration strategy but decreases the explorationparameter with time, �t = 0:1 � 0:99t.

4. lookahead-based exploration: The �-BR algorithm with a depth-limit of 2 combined with 2-lookahead learning algorithm.

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

0 50 100 150 200 250 300 350 400

Avg

. cum

ulat

ive

rew

ard

stage of the game

Diff. exploration strategies: DFA-size=60

look-aheadcombined

directedundirect

Figure 4.5: The average cumulative reward of di�erent exploration strategies while playing the IPD against100 random automata of size 60.

The results for opponent automata of size 60 are shown in Figure 4.5. When the experimentwas performed with automata of size 20, the directed and undirected methods showed comparableperformance. However, when the size of the automata was increased to 60, the directed strategiesoutperformed the undirected one. The exploration behavior of the undirected strategy is sensitiveonly to the length of the history and not to its content. Therefore, it does not modify its explorationbehavior when playing against di�erent opponents. The directed strategy, on the other hand,utilizes the content of the history to modify its exploration behavior. This is the reason why itoutperforms the undirected method when increasing the complexity of the opponent. The directedmethod, however, does not reduce its exploration tendency with time. The combined strategy enjoysthe advantages of both methods and outperforms them both. The 2-lookahead-based explorationstrategy outperforms signi�cantly the other exploration methods. This advantage carries the costof higher computational costs. Due to this higher costs we performed IPD games of length 200instead of 400.

47

Page 57: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

TFT Pavlov Grim TF2T Nydegger

MB-agent: 549:549 590:315 170:335 759:329 578:583

Table 4.2: The cumulative rewards attained by the MB-agent after 200 stages of the PD gameagainst �ve di�erent opponents.

4.5 Experiments against Non-random Opponents

In all the experiments described above the MB-agent was tested against randomly generated op-ponents. In this experiment we test the MB-strategy against non-random opponents that werehand-crafted speci�cally for the IPD game. We repeated the famous tournament organized byAxelrod [6] for the Iterated PD-game. In the original tournament, �fteen attendees (strategies)competed in a round-robin tournament, where every interaction was based on 200 repetitions ofthe PD game. The participated strategies were sent as computer programs by di�erent researchersaround the world. Most programs were designed to basically cooperate. They di�er mainly in theway that they deal with defection of their rivals. For this experiment we allowed only determinis-tic regular strategies to participate. Figure 4.6 shows the models and the best-responses learnedby the MB-agent for �ve di�erent opponents after 200 stages of the PD game. Table 4.2 showsthe cumulative rewards attained by the players. The MB-agent learned its opponents on-line dur-ing the 200 iterations using IT-US-L� and Boltzmann exploration, with the temperature functionT = 50 � 0:999t. The best-response strategies were computed according to the discounted sumutility function with = 0:9.

cd (5)

dcd (5)

d cd (5)

dcd (5)

d

d (1)c (3)c (0)

Grim

cd (5)

dcd (5)

d

c (3) c (0)

d (1)

cd (5)

dcd (5)

d

c (3)

d (5)

cc (3)

c (0)

d (1)c (3)c (3) Tf2T

c d d

c ddd

cccd

c

d

c

dc

d

d

c

cd

d

c

Nydegger

0 1 2 3

4567

891011

d

d

dc

d

d

c

dc

c

c

c

c

TFTd (1)c (3)

c (0)

Pavlov

Figure 4.6: The models learned by a MB-agent after 200 iterations of the PD-game. The best-responsecycles are highlighted.

Tit-for-tat (TFT): Cooperates at the �rst iteration and then follows the previous opponent'saction. The best-response is all-c (for any � 2

3). Only few iterations are needed for the

MB-agent to converge to the best response.

48

Page 58: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Pavlov: Also known as Win-stay-loss-shift. Pavlov changes its action in response to a failure. Thebest-response is all-d with average utility of 3 for the player (and 1

2point for Pavlov). Only

few iterations are needed for convergence.

Grim: Begins with cooperation but never forgive defection. The MB-agent succeeds in �ndingthe right model for Grim. The best response against grim is playing \all-c". However,the exploring agent falls into the \defection sink" and is not being able to get back to thecooperation cycle. This problem is common to all exploring strategies. Grim is a typicalexample of a strategy that is problematic for on-line adaptive players.

Tit-for-two-tat (TF2T): Cooperates for the two �rst stages; defects only after two consecutivedefections of its opponent. The best-response is to defect and then to ask for forgiveness(cooperate). The MB-agent converges to the right model and to the best-response after fewstage games.

Nydegger: Behaves like TFT at the beginning and then behaves according to the three previousjoint actions of both players. The best-response, all-c, was found after about 100 iterations.The model given in the �gure was learned after 200 iterations. It encapsulates many featuresof Nydegger. For example it cooperates after 3 mutual defections (see the paths 2-5-6-9,4-5-6-9, 11-5-6-9).

The MB-agent succeeded very well against the strategies described above except for Grim where itfailed. The adaptive strategy begins without any knowledge about the game and learns to cooperatewith players like TFT and Nydegger and to exploit players like TF2T and Pavlov. To deal withGrim a more sophisticated exploration strategy is needed, such as lookahead-based exploration.

49

Page 59: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 5

Model-based Interaction Strategies

for Alternating Markov Games

In this chapter we study Markov games, also known as stochastic games [92]. For these games,the results of agents' interaction depend on the current state of the world. Agents' joint actionsare followed by a game outcome and and by change of the world state. Game rules are changedrepeatedly as a function of the current state of the world.

For repeated games we have restricted the players strategies to mappings from histories toactions. In Markov games, we restrict the players strategies to mappings from states to actions.This Markovian restriction is based on the assumption that players do not distinguish betweendi�erent histories that lead to the same state. For stationary opponents, Markov games becomeidentical to Markov decision processes and an optimal interaction strategy can be found by dynamicprogramming. However, even simple games su�er from an extremely large state-space which makethe best response problem to become impractical.

In Markov games, as in repeated games, the lake of knowledge the agent faces with is itsopponent's strategy. In the absence of such knowledge, a model-based strategy is needed. Weconsider a more restricted family of Markov games, called Alternating Markov games (AMG) [63]for which the agents alternate their actions. These games generalize two-player board games suchas chess and checkers; problem domains that have dominated AI research ever since.

Littman [63] studies two-player zero-sum AMG and describes minimax-Q, a reinforcement learn-ing algorithm that converges to optimal strategy for the game. The convergence proof is based onthe convergence of Q-learning [104] and demands that each state-action pair will be tried in�nitelymany times. Moreover, the algorithm is restricted by the assumption that the opponent uses exactlythe same utility function as the player with a negative sign. The zero-sum assumption also implic-itly assumes that the opponent is a minimax player. In the following we generalize the frameworkto non-zero-sum AMG and allow the use of any decision procedure as an opponent model.

Section 5.1 formalizes alternating Markov games and addresses them as generalization to two-player board games. Section 5.2 describes model-based strategies for AMG and outlines relatedworks dealing with model-based strategies in game-playing. In Section 5.3 we describe the basicmodel-based search algorithm for AMG. Section 5.4 describes an algorithm for using uncertainopponent models, Finally, Section 5.5 concludes. Portions of this chapter have appeared earlier in[16].

50

Page 60: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

5.1 Alternating Markov Games

In this section we describe two-player alternating Markov games where deterministic control of thestate transitions alternates between the agent and its opponent. This framework includes standardboard games such as chess, checkers, and tic-tac-toe, but also capture more complex games in whichrewards are issued throughout interaction and not necessarily sum to zero. Moreover, the identityof the player in control is part of the state description, hence, control does not necessarily changehands after every action.

In general, a two-player alternating Markov game is represented by a state graph in whichagents alternate their activity by trying to achieve goal states and to accumulate rewards. It canbe characterized as follows:

� A set of world states. Each state speci�es the active agent who takes control, the otheragent is passive at this state. In addition, each state speci�es rewards for the agents. Forzero-sum games, the rewards are summed to zero. For non-zero-sum games the rewards areindependent.

� An initial state, a special state in which the game begins (it also speci�es the starting agent).

� A successor function that returns all states accessible for the active agent from a given state1.

� A set of absorbing states where the game is ended. For convenience we assume that any actionexecuted from an absorbing state returns the game to the initial state. This convention allowsus to deal with a sequence of alternating Markov games.

We assume that agents' knowledge is as in the repeated game framework. All the informationregarded the state graph, including the successor function, the set of states and rewards and the setof the absorbing states is a common knowledge, while the players preferences are private. Each agenthas a private utility function, usually called an evaluation function, that formulates its preferencesover states of the world.

It is easy to show that AMG generalize two-player board games. In chess, for example, theabsorbing states are all states where one of the player captures the opponent's king, or, stateswhere the game cannot continue further. A zero reward is associated with any non-absorbing stateand draw states. Win positions are rewarded by +1 for the winner and by -1 for the looser. Ingeneral, utility functions used by the agents evaluate the probability for achieving a win state fromthe given evaluated state. For non-zero sum games, the evaluation functions should estimate thesum of rewards which can be achieved until the game is ended.

De�nition 13 A two-player alternating Markov game is a tuple hS; s0; �; A; ; r1; r2; u1; u2i, whereS is a set of world states, s0 is the initial state, � : S ! 2S is the successor function, A � S is a

set of absorbing states. r1; r2 : S ! < are the reward functions over the states, and u1; u2 are the

private evaluation functions of the agents.

De�nition 14 A strategy for alternating Markov games �i : S ! S is a function that maps states

to actions. A pair of strategies (�1; �2) de�nes a path { an in�nite sequence of alternating moves

while playing the game.

1Usually, Markov games refer to games with probabilistic state transitions. In this work we only deal with

deterministic successor functions.

51

Page 61: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

For convenience we assume that a strategy returns the state achieved from execution of the chosenaction. We also assume that the agents objective is to maximize their cumulative rewards attainedthroughout the game.

5.2 Multi-model Search

Assume that you do not have a model of your opponent's strategy. How should you play? Fora small state graph, an optimal strategy can be found by searching the entire game tree. Atnodes where the player's turn to play (MAX nodes), the procedure chooses the successor with themaximal value. At nodes where the opponent's turn to play (MIN nodes) the procedure choosesthe successor with the minimal value. This procedure known as minimax search: �nds the optimalmove by back induction. It traverses the entire game tree in a depth-�rst manner and �nds anoptimal path with the maximum sum of rewards.

In general, game trees are too big to traverse entirely by computational bounded agents. Acommon alternative is to search for a limited depth and to treat states at the horizon of the searchas absorbing states, including evaluating their utility by an evaluation function.

The search procedure described above, known as the minimax algorithm [91], has served asthe basic decision procedure for zero-sum games since the early days of computer science. Theminimax's basic assumption is that the player has no knowledge about the opponent's decisionprocedure. Minimax therefore assumes that the opponent will select the alternative which is theworst from the player's point of view.

But what if the player does have some knowledge about its opponent? Such knowledge may beacquired through accumulated experience in playing against the opponent, or may be supplied bysome external source. In chess, for example, simple book-learning methods may reveal importantknowledge about the opponent's playing style, such as prefering bishops over knights, avoidingpiece exchange, etc.

How can such knowledge be exploited? Jansen [49] proposes speculative play, a playing strategythat speculates a non-optimal play by the opponent, and describes two situations where this ap-proach can be bene�cial. One is a swindle position in which the player has a reason to believe thatthe opponent will underestimate a good move, and will therefore play a poorer move instead. Asecond one is a trap position in which the player expects the opponent to overestimate and, there-fore to play a bad move. In a later work, Jansen [50] analyzes KQKR chess end-game and showsthat speculative play can have an advantage over traditional minimax play. Jansen [50] proposeda future research dealing with \asymmetric search" algorithms that use two di�erent evaluationfunctions, one belonging to the player, and the other to its opponent. Korf [59] outlined a generaldirection for such an algorithm. Iida et al. [46, 47], describe algorithms that incorporate opponentmodels into adversary search. The opponent's decision procedure is assumed to be minimax witha di�erent evaluation function.

In this chapter we introduce multi-model adversary search that incorporates recursive opponentmodels, where the opponent model includes (recursively) a model of the player. Minimax canbe viewed as a procedure that simulates the opponent decision procedure to predict its move.The implicit opponent model used by minimax is a minimax procedure using the same evaluationfunction as the player. We generalize the minimax approach by allowing the use of any decisionprocedure as an opponent model. One possible candidate for such a model is minimax with anevaluation function that is di�erent from the player's. A more general approach de�nes an opponentmodel to be a recursive structure consisting of a function of the opponent and the player model

52

Page 62: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

a

b c

d e

ϕ ϕ

3 6 2 8

6

6

8

8

8

Figure 5.1: The computation of the M value of state a. Squares represent states where it is the player'sturn to play. Circles represent states where it is the opponent's turn to play. The opponent model ' is usedto obtain the move selected by the opponent while the static evaluation function is used to evaluate theresulting state.

(held by the opponent). M�, the main algorithm described in this chapter, is a generalization ofminimax that utilizes this kind of recursive structure as an opponent model.

5.3 Incorporating the Opponent Model into the Search

We start the development of our framework by assuming that the player has an access to an oraclethat speci�es the opponent's decision for each state of the game. Later, we will replace the oracleby speci�c algorithms.

Let S be a set of game states. Let � : S ! 2S be the successor function. Let ' : S ! S bean opponent model that speci�es the opponent's selected move for each state. The M value of aposition s 2 S, a depth limit d, a static evaluation function f : S ! <, and an arbitrary opponentmodel ', is de�ned as follows:

M(s; d; f; ') =

8>><>>:

f(s) d = 0maxs02�(s)

(f(s0)) d = 1

maxs02�(s)

(M('(s0); d�2; f; ')) d > 1

The M value of a position s is obtained by generating the successors of s and applying ' oneach of them to obtain the opponent's response. The M value of each of the selected states is thenevaluated (recursively), and the maximum value is returned. The M value of a position s with zerodepth is de�ned to be its static value.

Figure 5.1 illustrates the computation of M . At each level of the tree associated with theopponent, the player exploits the opponent model, ', to obtain the opponent's response. For nodeb, ' returns node d. The player then evaluates node b by callingM to evaluate node d. M evaluatesnode d by evaluating its two leaves using the evaluation function f , and returns the maximal value6. Therefore, node b is worth 6 for the player. Similar reasoning shows that node e, hence node cis worth 8 for the player. The M value of the root is 8 and the player selects the move leading tonode c.

Let MM(s; d; f) be the minimax value of a state s returned by minimax search to depth d

using the evaluation function f . Minimax can be de�ned as a special case of M since it applies

53

Page 63: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

itself as an opponent model using �f as evaluation function and one less depth limit. We denotethe minimax value of a state s by M0

(hfi;d)(s).

M0(hfi;d)(s) =M(s; d; f;M0

(h�fi;d�1)) =MM(s; d; f):

A natural candidate to serve as the model ' for computing M is M0 with a given static

evaluation function f0. We de�ne the M1 value of as position s to be:

M1(hf1;f0i;d)(s) =M(s; d; f1;M

0(hf0i;d�1)):

The M1 value of a position s is computed by simulating the opponent's minimax search to

depth d � 1 to obtain its selected move and evaluating the selected move recursively with depthd� 2.

a

b c

d e f g

h i j k l m n o5 2 6 4 7 9 8 5

a

b c

e f

j k l m29 1 3

9

9

3

3

2 4 57

(a) (b)

9

Figure 5.2: The computation of the M1 value of position a with depth limit 3. Squares represent nodeswhere it is the player's turn to play. Circles represent nodes where it is the opponent's turn to play. Part (a)shows the two calls to minimax, using f0, for determining the opponent's choices. Note that the opponentis a maximizer. Part (b) shows the recursive calls of M1 using f1, for evaluating the opponent choices.

Figure 5.2 illustrates the evaluation of M1 with depth limit 3. Part (a) shows the two calls tominimax to simulate the opponent's search. Note that the opponent is a maximizer. Part (b) ofthe �gure shows the two recursive calls toM1 applied to the states selected by the opponent. Notethat while the opponent's minimax procedure assumes that in node e the player will select node k,the player actually prefers node j.

Finally, we can generalize and de�ne the Mn value of a game-state as follows:

Mn(hfn;:::;f0i;d)(s) =M(s; d; fn;M

n�1(hfn�1;:::;f0i;d�1)):

Thus,M1(hf1;f0i;d) uses minimax,M

0(hf0i;d�1), as an opponent model. M

2(hf2;f1;f0i;d) usesM

1(hf1;f0i;d�1)

as an opponent model etc.

5.3.1 The M� algorithm

In this section we describe the M� algorithm for computing the Mn value of a game state. One ofthe arguments of the M� algorithm is a structure called player which includes information aboutboth the player's evaluation function and its opponent model.

54

Page 64: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

De�nition 1 A player is a pair de�ned as follows:

1. Given an evaluation function f0, P0 = (f0;?) is a zero-level player.

2. Given an evaluation function fn and an (n � 1)-level player On�1, Pn = (fn; On�1) is an

n-level player. On�1 is called the opponent model.

Thus, a zero-level player, (f0;?), is one that does not model its opponent. A one-level player,(f1; (f0;?)), is one that has a model of its opponent, but assumes that its opponent is a zero-levelplayer. A two-level player, (f2; (f1; (f0;?))), is one that uses a function f2, and has an opponentmodel, (f1; (f0;?)). The opponent model uses a function f1 and has a player model, (f0;?). Therecursive de�nition of a player is in the spirit of the Recursive Modeling Method by Gmytrasiewicz,Durfee and Wehe [40].

The M� algorithm returns the Mn value of a game-state. It receives a game-position, a depthlimit and a player, and outputs he Mn value of the position and the move selected by the player.The algorithm generates the successors and simulates the opponent's search for each of them todetermine its choice. The simulation is performed by applying the algorithm recursively withthe opponent model as the player. The player then evaluates each of its moves by applying thealgorithm recursively on each of the opponent's selections using its own function.

f0= 4 f0=-8 f0=10 f0= 3 f0=-4 f0= 4 f0= 4 f0= 6

a

b c

d e f g

h i j k l m n o

M*(b,2,(f1,f0))

M*(d,1,f0)=4 M*(e,1,f0)=10

M*(c,2,(f1,f0))

M*(f,1,f0)=4 M*(g,1,f0)=6

M*(a,3,(f2,(f1,f0)))

(a)

f1=-6 f1=-8 f1=-2 f1= 0

a

b c

d e f g

h j m o

M*(b,2,(f1,f0))=-6

M*(h,0,(f1,f0))=-6 M*(j,0,(f1,f0))=-8

M*(c,2,(f1,f0))=0

M*(m,0,(f1,f0))=-2 M*(o,0,(f1,f0))=0

M*(a,3,(f2,(f1,f0)))

(b)

f2= 8 f2=-4 f2=10 f2= 2

a

b c

d g

h i n o

M*(d,1,(f2,(f1,f0)))=8

M*(a,3,(f2,(f1,f0)))=10

M*(g,1,(f2,(f1,f0)))=10

(c)

Figure 5.3: The set of recursive calls generated by calling M�(a; 3; (f2(f1; f0))). Each call is written nextto the node it is called from. (a) The player (f2; (f1; f0)) calls its model (f1; f0) which calls its model of theplayer (f0). The moves selected by (f0) are highlighted. (b) The model (f1; f0) evaluates the states selectedby (f0) using evaluation function f1. (c) The player evaluates the states selected by its model using f2.

55

Page 65: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Figure 5.3 shows an example of a game-tree searched by M�(a; 3; f2(f1; f0))2. The recursive

calls applied to each node are listed next to the node. The moves selected by the various modelsare highlighted.

The player, (f2; (f1; f0)), simulates its opponent's search from nodes b and c. The opponentmodel, (f1; f0), simulates the player by using its own model of the player, (f0), from nodes d and e.At node d the model of the player used by the opponent (f0) selects node h (Figure 5.3(a)). Theopponent then applies its f1 function to node h and concludes that node h, and therefore node d, areworth �6 (Figure 5.3(b)). The opponent then applies the player's model (f0) to node e, concludesthat the player select node j, applies its own function (f1) to node j, and decides that node j, andtherefore node e, are worth �8. Therefore, the opponent model, when applied to node b, selectsthe move that leads to node d. The player then evaluates node d using its own criterion (f2). Itapplies M� to node d and concludes that node d, and therefore node b, are worth 8 (Figure 5.3(c)).Simulation of the opponent from node c yields the selection of node g. The player then evaluates gaccording to its own strategy and �nds that it is worth 10 (the value of node n). Note that whenthe opponent simulates the player from node g, it wrongly assumes that the player selects node o.Therefore, the player selects the move that leads to c with a value of 10. Using a regular minimaxsearch with f2 would have resulted in selecting the move that leads to node b with a value of 7.The formal listing of the M� algorithm is shown in �gure 5.4.

Procedure M� (s; d; Pn)

if d = 0 then return h?; fn(s)ielse

vsn �1for each s

0 2 �(s)

if d = 1 then vs0

n fn(s0)

elseDs00; v

s0

n�1

E M

� (s0; d� 1; On�1)Ds000; v

s0

n

E M

� (s00; d� 2; Pn)

if vs0

n > vsn then hb; vsni

Ds0; v

s0

n

Ereturn hb; vsni

Figure 5.4: The M� algorithm.

The Mn value of a state is well de�ned when the modeling level of the player, n, is smallerthan the depth of search d. When n reduces to zero we use M0 (minimax) as an opponent model.However, note that M� does not contain such a call. To handle the case when n < d we replace

the zero-level player, (f0;?), byd�nz }| {

(f0; (�f0; (f0; : : :?))) which is equivalent to a minimax player.

5.3.2 A one-pass version of M�

The M� algorithm performs multiple expansions of parts of the search tree. We have developeda directional version of the algorithm [75], called M

�1p, that expands the tree one time only. The

algorithm expands the search tree in the same manner as minimax. However, node values are

2We use (f2(f1; f0)) as a shortcut for (f2(f1; (f0;?))).

56

Page 66: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

propagated up the tree di�erently. Whereas minimax propagates only one value, M� propagatesn+1 values, (vn; : : : ; v0). The value vi represents the merit of the current node to the i-level model.M

�1p passes values associated with the player and values associated with the opponent in a di�erent

manner. For a given node, we will denote all the models associated with the player whose turn isto play as active model, and all the models associated with the other player as passive models. Forvalues associated with the active player, vi receives the maximal vi value among its children. Forvalues associated with the passive player, vi receives the vi value of the child that gave the maximalvalue to vi�1.

f2= 8f1=-6f0= 4

f2=-4f1= 6f0=-8

f2= 4f1=-8f0=10

f2= 7f1=-7f0= 3

f2=-6f1= 7f0=-4

f2= 1f1=-2f0= 4

f2=10f1=-4f0= 4

f2= 2f1= 0f0= 6

a

b c

d e f g

h i j k l m n o

V[2]= 8V[1]=-6V[0]= 4

V[2]= 7V[1]=-8V[0]=10

V[2]= 1V[1]=-2V[0]= 4

V[2]=10V[1]= 0V[0]= 6

V[2]=10V[1]= 0

V[2]= 8V[1]=-6

V[2]=10

Figure 5.5: The value vectors propagated by M�1p. This is the same tree as the one shown in Figure 5.3.

Figure 5.5 shows an example for a tree spanned by M�1p. This is the same tree as the one shown

in Figure 5.3. Let us look at node e to understand the way in which the algorithm works. Threeplayers evaluate node e: the player, the opponent model, and the opponent's model of the player.The opponent's model of the player evaluates all the successors using evaluation function f0 andselects state j with a value of 10. The opponent model knows that it has no e�ect on the decisiontaken at node e, since it is the player's turn to move. It therefore assigns node e the value ofj, selected by the player model, using its own evaluation function f1, (f1(j) = �8). The playeractually prefers node k with the higher f2 value (f2(k) = 7). Thus, the vector propagated fromnode e is h7;�8; 10i. Note that the values in the vectors correspond to the results of the recursivecalls in �gure 5.3. Figure 5.6 lists the M�

1p algorithm.

5.3.3 Properties of M�

Correctness and complexity

The following theorem proves that both M� and M

�1p return the Mn value of a game-state.

Theorem 8 Let Pn be an n-level player. Let hb; vsni =M�(s; d; Pn). Let hvn; : : :vdi =M

�1p(s; d; P

n).Then v

sn = vn =M

n(Pn;d)(s).

Proof. M� implements the de�nition of Mn therefore it is clear that it returns the Mn value of

state s. We will show that M�1p returns the same value as M� by induction on the depth d. For

d = 0 and d = 1 the proof is immediate. Assume thatM� andM�1p return the same value for depth

� d� 1.

57

Page 67: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Procedure M�1p(s; d; P

n)if d = 0 then return hfn(s); : : : ; f0(s)ielse

v h�1; : : : ;�1ifor each s

0 2 �(s)v0 M

�1p(s

0; d� 1; Pn)

for each active model jif v

0j > vj then

vj v0j

if j < n then vj+1 v0j+1

return hvn; : : : ; vdi

Figure 5.6: M�1p: A version of the M� algorithm that performs only one pass over the search tree.

1. For each successor s0, M� �rst determines the state s00 selected by its opponent. hs00; vs0n�1i =M

�(s0; d� 1; On�1). By the induction hypothesis,

vs0

n�1 =M�1p(s

0; d� 1; On�1) = vn�1(s

0)

where vi(s) denotes the vi value returned by M�1p for position s.

2. vn�1(s0) = v

s0

n�1 = vs00

n�1

induction= vn�1(s

00)

3. M� determines the player's value of each successor s0 by calling itself on s00. hs000; vs0n i =

M�(s00; d� 2; Pn). By the induction hypothesis

vs0

n =M�1p(s

00; d� 2; Pn) = vn(s

00)

4. M�1p assigns to each successor s0 the vn value associated with the state with maximal vn�1

value. Since s00 determines vn�1(s0), vn(s0) = vn(s

00).

5. The value returned by M�(s; d; Pn) is

vsn = max

s02�(s)vs0

n = maxs02�(s)

vn(s00) = max

s02�(s)vn(s

0) =M�1p(s; d; P

n)

It seems as thoughM�1p is always prefered overM� because it expands less nodes. M�

1p expandseach node in the tree only once, while M� re-expands many nodes. However, while M�

1p performsless expansions than M

�, it may perform more evaluations. For example, for the tree shown inFigure 5.3 and Figure 5.5, M� performs 9 expansions and 16 evaluations, while M

�1p performs 7

expansions and 24 evaluations. The following lemma counts the number of evaluations of terminalpositions performed by M� and M

�1p while searching a uniform game-tree.

Lemma 9 1. The number of evaluations performed by M� while searching a uniform tree with

branching factor b and depth d is

T (b; d) =�d+1b � �b

d+1

pb2 + 4b

(5.1)

58

Page 68: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

where �b =b+

pb2+4b2

, and �b =b�pb2+4b

23. The asymptotic branching factor of M� is b� = �b.

For large b, b� converges to b + 1, i.e., the number of evaluations can be approximated by

(b+ 1)d.

2. The number of evaluations performed by M�1p is

4 (d+ 1)bd.

Proof. The number of terminal positions examined by M� while searching a uniform tree can be

established by the following recurrence:

T (b; d) =

8><>:

1 if d = 0b if d = 1b [T (b; d� 1) + T (b; d� 2)] otherwise.

For d = 0 and d = 1 the proof is immediate. Assume it's correctness for depths less than d.

T (b; d) = b [T (b; d� 1) + T (b; d� 2)] = b

24 �db � �b

d

pb2 + 4b

+�d�1b � �b

d�1

pb2 + 4b

35 =

= bpb2+4b

�(�b + 1)�d�1

b � (�b + 1)�bd�1

�b and �b are both solutions to the equation x2 = b(x+1), hence, b(�b+1) = �2b and b(�b+1) = �b

2

.Substitution the two equalities into equation 5.3.3 completes the proof for the �rst statement.For the second statement, it is easy to show by induction on d that �d�1

b � T (b; d)� �db . Therefore,

d�1

d

b � T (b; d)1d � �b.

b� = lim

d!1

24�d+1b � �b

d+1

pb2 + 4b

35

1d

= �b:

The third statement follows from the inequalities:

b+ 1� 1

b� b

� =b+pb2 + 4b

2� b+ 1:

For M�1p, any leaf of the game-tree is examined one time only, but is evaluated d+ 1 times.

The theorem implies that M� and M�1p each has an advantage. If it is more important to

reduce the number of node expansions than the number of evaluations then one should use M�1p.

Otherwise, M� should be used. For example, for d = 10 and b = 30 and a one-level player,M

� expands 1.3 times more nodes than M�1p but M

�1p performs 1.5 times more calls to evaluation

functions. Note that when the set of evaluation functions consist of the same features (perhapswith di�erent weights), the overhead of multiple evaluation is reduced signi�cantly.

3For b = 1 we get the Binet formula for Fibonacci sequence.4When the modeling-level n is less than d, the number of evaluations performed by M�

1p is (n+1)bd (the last d�n

evaluations are f0(s) with alternate signs).

59

Page 69: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

The relationship between M�and minimax

We have already mentioned that M� is a generalization of minimax. It is easy to see that whenusing a player P

n = (f; (�f; (f; (�f; : : :)))), M� returns the minimax value of a state s, i.e.M

�(s; d; Pn) = MM(s; d; f). The following theorem shows that for an arbitrary player, M� al-ways returns a value greater or equal to the minimax value, when both algorithms search to thesame depth and use the same player`s function fn.

Theorem 10 Let Pn = (fn; On�1) be a player. Then MM(s; d; fn) � M

�(s; d; Pn) for any oppo-

nent model On�1.

Proof. We will prove by induction on the depth of search that this property holds for every nodein the tree traversed by the two algorithms. For convenience, we will prove it for M�

1p. For d = 0,for any player Pn = (fn; O

n�1),

MM(s; 0; fn) = fn(s) =M�1p(s; 0; P

n):

Let us assume correctness for any depth k � d. We will prove the theorem for depth d+ 1. If s isa player's node:

MM(s; d+1; fn) = maxs02�(s)

fMM(s0; d; fn)ginduction� max

s02�(s)fM�

1p(s0; d; P

n)g ==M

�1p(s; d+1; P

n):

If s is an opponent's node: let s0m be the successor with the maximal value according to the opponentmodel.

MM(s; d+1; fn) = mins02�(s)

fMM(s0; d; fn)g �MM(s0m; d; fn)induction� M

�1p(s

0m; d; P

n) =

=M�1p(s; d+1; P

n):

Minimax, which assumes no knowledge about the opponent, takes a pessimistic approach to theopponent's choices. The above theorem implies that when the player possesses an opponent modelthat reliably predicts its choices, M� will lead the player to positions with higher static valuesaccording to its evaluation function. Whether such a course of game also increases the playingability, depend on the quality of the player's evaluation function.

The relationship between M�and Maxn

The Maxn algorithm [66] also propagates vectors of values up the game tree. However, M�1p and

Maxn are targeted at di�erent goals. Maxn is an extension of minimax that can handle n players,while M�

1p is an extension of two-player minimax that can handle n modeling levels. It is temptingto claim that M�

1p is a special case of Maxn where each model stands for one player, but such aclaim is incorrect. In Maxn, at each level of the search tree, only one player has an active role ofselecting a move. The other passive players pass on the values associated with the move selectedby the active player. In M

�1p, every second model is associated with the active player, therefore,

half of the models select their moves, and the passive players pass on the values associated withthe selection of their models.

60

Page 70: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Maxn M�1p

At node e, it is the turn of p0 At node e, it is the player's turn to play

p0 uses function f0 to evaluate states The opponent's model of the player,(f0;?), uses function f0 to evaluatestates

p0 selects state j (with a value of 10) The opponent's model of the player se-lects state j (with a value of 10)

p1 evaluates node j using f1, (f1(j) =�8)) and propagates this value to e

The opponent model, (f1; f0), evalu-ates node j using f1, (f1(j) = �8) andpropagates the value to e

p2 knows that it has no e�ect on thedecision in e

The player knows that it is his turn to

move in e, therefore it can decide what

move will be actually taken

p2 evaluates node j using f2, (f2(j) =4)) and propagates this value to e

The player evaluates both j and k us-ing f2, decides that k is a better nodeand that it will actually select k, andpropagates f2(k) = 7 to node e

The vector propagated up is h4;�8; 10i The vector propagated up is h7;�8; 10i

Table 5.1: The di�erence in behavior between Maxn and M�1p at node e in Figure 5.5. The left

column speci�es the steps taken by Maxn, using 3 players: p2 with evaluation function f2, p1 withevaluation function f1, and p0 with evaluation function f0. The right column shows the steps takenby a M�

1p using the player P 2 = (f2; (f1; f0)).

For this reason, the two algorithms propagate values up the tree di�erently. While all the valuesof a vector in Maxn come from the same leaf, the values of a vector in M�

1p may come from di�erentleaves. To illustrate the di�erence, let us look at Figure 5.5. Table 5.1 shows the steps taken byeach of the algorithms at node e. The left column speci�es the steps taken byMaxn, using 3 players:p2 with evaluation function f2, p1 with evaluation function f1, and p0 with evaluation function f0.The right column shows the steps taken by a M�

1p using the player P 2 = (f2; (f1; f0)).

In spite of the di�erence between the two algorithms, it is easy to see that M� can simulateMaxn, for the case of a two player search, by using the player (f1; (f2; (f1; (f2; : : :)))). A close lookat the algorithm for multi-player search outlined by Korf [59], reveals that it is essentially identicalto Maxn where the models serve as the players in the Maxn algorithm. This approach allows onlyone model to make a decision at each level of the tree, and ignores the ability of all the modelsassociated with the active player to make their own decisions.

5.4 Using Uncertain Models

In the previous sections we assume that the player is certain about its opponent model. It is morerealistic, however, to assume that the player is uncertain about its opponent model, especially whenthe model is acquired by learning. In this section we generalize M� to a new algorithm, M�

� , thatcan handle uncertain models.

There are several possibilities for representing model uncertainty. In this work we assume that

61

Page 71: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

the player possesses an upper bound on the distance between the model's evaluation function andthe actual opponent's function. An uncertain player is now de�ned as Pn

� = ((fn; �n); On�1� ), where

On�1� is also an uncertain player and �n is an upper bound on the distance between the n-level

modeling function and the actual opponent's function. Each evaluation function that belongs to anuncertain player (the player's function, its model's function, its model's model function etc.), hasan associated bound on its error. Since the highest level player is certain about its own function,the error bound associated with the top level function will be zero.

Thus, an one-level uncertain player P 1� = ((f1; 0); (f0; �0)) assumes that the opponent possesses

a function f0 that satis�es

8xhf0(x)� �0 � f0(x) � f0(x) + �0

i:

We de�ne the error bounds associated with the model's functions to apply to the range determinedby the opponent's error bounds. For example, a player ((f2; 0); ((f1; �1); (f0; �0))) assumes that itsopponent is ((f1; 0); (f0; �0)) where

8xhf1(x)� �1 � f1(x) � f1(x) + �1

i8x

hf0(x)� �0; f0(x) + �0

i� [f0(x)� �0; f0(x) + �0] :

The input for the M�� algorithm is the same as for the M� algorithm, but with an uncertain

player that �ts our extended de�nition. The output, however, is di�erent. Instead of returning astate and a value, the new algorithm returns a set of states and a range of values. Theorem 11shows that for every state in the set of states returned by the algorithm, there is a certain player,associated with a set of functions satisfying the error constraints, for whichM� would have returneda state and a value belonging to that range.

The algorithm generates all the successors and calls itself recursively for each successor, withthe opponent model as the player, to determine the set of moves available for the opponent. Foreach state in such a set of states, the player calls itself recursively to determine the range of valuesthat the state is worth for the player. Since the player is uncertain about the opponent's selection,it takes the union of these ranges as the range of values that the successor is worth. At this stage,the player has an associated range of values for each of its alternative moves. The lower bound ofthe range returned by the algorithm is the maximal minimum of all these ranges; this is becauseeven with the worst possible set of functions satisfying the error constraints, the player is stillguaranteed to have at least the maximal minimums. The upper bound is the maximal maximum ofall ranges since none of the states can have a value which exceeds this maximum. The algorithmsreturns the set of all states that can possibly have a value which is maximal and therefore can beselected. These are all the states whose maximal value falls within the above computed range.

The algorithm returns a set of states and a range of values. In order to select a move, a playercan employ various selection strategies. A natural candidate is maximin [65, page 275-326], astrategy that selects the state with maximal minimum.

Figure 5.7 shows an example of two calls for M�� on the same search tree, one with � = 1 and

one with � = 0:5. In the case of � = 1, M�� selects the same move as minimax does. In the case

of � = 0:5, M�� selects the same move as M� does. Notice that, in the left tree in Figure 5.7, the

opponent selects a move at node b. The player knows that the value that the opponent assigns tonode d is in the range [�1; 1], that the value for node e is in the range [7; 9], and that the valuefor f is in the range [9; 11]. There are no circumstances under which the opponent will select noded, since its maximum is smaller than the minimum of e and the minimum of f . It is possible that

62

Page 72: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

f1= 0f0= 8±1

f1= 5f0= 10±1

f1= 1f0= 4±1

f1= 2f0= 8±1

a

b c

e f g h

M*(f1,f0)=[2,5]

M*(f0)=[9,11]

M*(f1,f0)= [0,0] M*(f1,f0)=[5,5]

[0,5]

M*(f1,f0)=[2,2]

M*(f0)=[7,9] [2,2]

d

f1= 10f0= 0±1

f1= 0f0= 8±0.5

f1= 5f0= 10±0.5

f1= 1f0= 4±0.5

f1= 2f0= 8±0.5

a

b c

e f g h

M*(f1,f0)=[5,5]

M*(f0)=[9.5,10.5]

M*(f1,f0)= [0,0] M*(f1,f0)=[5,5]

[5,5]

M*(f1,f0)=[2,2]

M*(f0)=[7.5,8.5] [2,2]

d

f1= 10f0= 0±0.5

Figure 5.7: An example for the sequence of calls produced by M�� . The left �gure shows the calls for the

case of � = 1 while the right one is for the case of � = 0:5.

node e will be selected (if f0(e) = 9 and f0(f) = 9). There is also a possibility that node f willbe selected (if f0(f) > 9). Therefore, the player concludes that the opponent may select eithernode e or node f . The minimal possible value that can be assigned to node b by the opponent is9. The maximal possible value is 11. Therefore, node b is worth a value in the range [9; 11] for theopponent. Node e is worth 0 for the player and node f is worth 5, node b, therefore, is worth avalue in the range [0; 5] for the player. Similar reasoning shows that node c is worth [7; 9] to theopponent and, since only node h can be selected by the opponent, node c is worth [2; 2] for theplayer. A cautious decision procedure, such as maximin, would prefer node c which is worth [2; 2]over node b which is worth [0; 5]. Figure 5.8 shows the M�

� algorithm.

Procedure M�� (s; d; P

n� )

if d = 0 then return h?; [fn(s)� �n; fn(s) + �n]ielse

for each s0 2 �(s)

if d = 1 then s0range [fn(s

0)� �n; fn(s0) + �n]

elseDs00b ; s

00range

E M

�� (s

0; d� 1; On�1

� )

for each b 2 s00bD

s000b ; s

000range

E M

�� (b; d� 2; Pn

� )

s0ranges s

0ranges [ fs000rangeg

s0range [min(i);max(j)][i;j]2s0ranges

sranges sranges [ fs0rangeg[smin; smax] [max(i);max(j)][i;j]2srangessb fs0 2 �(s) j s0max � smingreturn hsb; [smin; smax]i

Figure 5.8: The M�� algorithm.

There are other adversary search algorithms that return a range of values as doesM�� . However,

these algorithms were designed for di�erent purposes. The B* algorithm [9] returns a range of valuesdue to uncertainty associated with the player's evaluation function. Nevertheless, unlike the M�

algorithm, B* adapts the basic zero-sum assumption and propagates values in the same manner as

63

Page 73: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

minimax does.

It is easy to see that M�� is a generalization of M� by calling M

�� with all error bounds equal

to zero5. Furthermore, we can prove a stronger relationship between the two algorithms. Thefollowing theorem states that calling M� with a certain player whose functions fall within the errorbounds of an uncertain player, returns a value that is within the range returned byM�

� , and a movethat is a member in the set of moves returned by M

�� for the uncertain player.

Theorem 11 Let Pn� be an uncertain player and let hB; [i; j]i = M

�� (pos; d; P

n� ). Let Pn

c be a

certain player consisting of arbitrary functions that satisfy the error constraints of Pn� . Let hb; vi =

M�(pos; d; Pc). Then v 2 [i; j] and b 2 B.

Proof. By induction on d, the depth of the search tree. For d = 0 and d = 1 the proof followsimmediately. Assume that the theorem is true for k < d. According to the �rst inductive as-sumption, for any s 2 �(pos), M�(s; d � 1; On�1

c ) returns a state b that belongs to B, the groupof states returned by M

�� (s; d � 1; On�1

� ). According to the second inductive assumption, for anyb 2 B, M�(b; d � 2; Pn

c ) returns a value v in [i; j], the range returned by M�� (b; d � 2; Pn

� ). Forany s 2 �(pos), M�

� returns a range [a1; a2] such that a1 is the minimal value of all the ranges ofstates in B, and a2 is the maximal value of all these ranges. Therefore, v, the M

� value of b that isassociated with s, belongs to the range thatM�

� associates with s. Finally, M� returns the maximal

value among the successor values, this value belongs to the range returned by M�� , and all states

associated with this M� value, will belong to the group of states returned by M�� .

It is interesting to note that if we call M�� with an uncertain player with in�nite error bounds,

and with themaximin selection strategy, we actually get a minimax search. WheneverM�� simulates

the opponent's search, all successor's states will be returned since none of them can be excluded.The player will then compute the values for these states and will select the one with maximalminimum value, exactly as minimax does. Thus, while in section 5.3 we interpreted minimax as aplayer who assumes that its opponent uses its own function (with opposite sign), here we interpretminimax as a player who has no model of its opponent and is therefore totally uncertain about itsreactions.

5.5 Summary

This chapter describes multi-model search, a model-based strategy for alternating Markov gamesthat can incorporate opponent models into the decision procedure. An opponent model is a recursivestructure consisting of an evaluation function and the player model held by the opponent. The M�

algorithm simulates the opponent's search to determine its expected decision for the next move andthen evaluates the resulted state by recursively searching its associated subtree using the player'sown strategy.

In our framework, the opponent model is represented by its evaluation function and by itsplayer model. In previous work [17] we described extensions of M� that can handle a model of theopponent depth of search. This extension is especially useful for weaker opponents, allowing, for

5In fact, M� obtained by a specialization of M�

� has advantage over the original M�. In the original M�, we did

not handle the case where a node has successors with equal values. M�

� called with zero error bounds will correctlyassume the worst case for opponent nodes with more than one possible outcome, while M� would have un-justi�ably

returned the �rst.

64

Page 74: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

example, to set traps. The opponent model can be extended to include other components relatedto the opponent's playing strategy such as open-game and end-game strategies, time-allocationstrategy, etc.

One of the most important techniques in searching game trees is �� pruning [57]. Pruning isimpossible in the general case for multi-model search since the zero-sum assumption does not hold.In the next chapter We shall explore the possibility of incorporating pruning techniques into M�.We will show that pruning is allowed given a bound on the sum of the player's and the model'sutility functions.

65

Page 75: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 6

Adding Pruning to Multi-Model

Search

One of the most signi�cant improvements to minimax is the �� pruning technique which canreduce the e�ective branching factor, b, of the tree traversed by the algorithm to approximately

pb

[57]. In the previous chapter we introduced the multi-model search, a generalization of minimaxthat can incorporate recursive opponent models where the opponent model includes (recursively) amodel of the player. In this chapter we study the possibility of adding pruning to the multi-modelsearch. We prove su�cient conditions over the opponent model that enable pruning and describetwo multi-model pruning algorithms using opponent models that satisfy these conditions. We provecorrectness and optimality of the algorithms and provide an experimental study of their pruningpower.

Section 6.1 describes simple pruning, a simple modi�cation of M� that reduces the averagebranching factor signi�cantly. Section 6.2 outlines a su�cient condition that enable pruning andSections 6.3, 6.4 exploits this condition for developing pruning algorithms for multi-model search.In Section 6.5 we study the average case performance of the algorithms in arti�cial game trees andSection 6.6 concludes. Portions of this chapter have appeared earlier in [24].

6.1 Simple Pruning

For most practical cases the modeling-level of the player will be smaller than the depth of thesearch tree. In such cases M� simulates many zero-level searches at inner nodes of the tree. Wehave already mentioned that a zero-level search is identical to a minimax search. An obviousimprovement forM� is to simulate these searches by calling ��. We call such a modi�cation simple

pruning. A similar pruning method for the one-pass version of one-level search was presented byIida (�-pruning) [48].

Lemma 12 Assume that M� is modi�ed to simulate a zero-level search by an �� search. Let

Pn be an n-level player. An upper bound on the number of evaluations performed by the modi�ed

algorithm while traversing a uniform tree with branching factor b and depth d > n, is ( bb�1

)nbd. A

lower bound on this value is (bd2c)n(bbd+n2 c + b

dd+n2e).

Proof. Let Fn(b; d) be the number of evaluations performed by the modi�ed M� on a uniform tree

with depth d > n and branching factor b. Let F0(b; d) be the number of positions evaluated by ��

66

Page 76: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

on such a tree. F1(b; d) can be established recursively:

F1(b; d) =

8><>:

1 d = 0b d = 1b [F0(b; d� 1) + F1(b; d� 2)] otherwise.

(6.1)

Upper and lower bounds for F0(b; d) are bd and b

bd2c + b

d d2e � 1 respectively [57]. Repeatedly

substituting the upper bound in Equation 6.1, we get F1(b; d) � ( bb�1

)bd. An upper bound for

Fn(b; d) can be derived inductively from the upper bound for Fn�1(b; d): Fn(b; d)��

b

b� 1

�n

bd.

For the lower bound, repeatedly substituting the lower bound for F0(b; d) in Equation 6.1, we get

F1(b; d)� bd2c(bb d+12 c + b

d d+12e)� (

b

b� 1)(bb

d2c � 1):

A lower bound for a greater modeling level can be approximated by (bd2c)n

�bbd+n

2c + b

d d+n2

e�.

The upper bound for simple pruning, ( b

b�1)nbd, is much lower than the number of evaluations

performed by M�, which is approximately (b + 1)d. Furthermore, the modi�ed M

� can achieveits lower bound by ordering the successors of the nodes in the zero-level searches according to thebest-case order for ��.

6.2 A Su�cient Condition for Pruning

The simple pruning algorithm prunes only zero-level searches. In this section we discuss the possi-bility of pruning by higher-level searches.

The �� algorithm exploits the strong dependency between the value of a node for the playerand its value for the opponent to avoid visiting branches that cannot change the minimax value ofthe tree. For example, when searching for the minimax value of the tree, shown in the left partof Figure 6.1, and using f1, after visiting node f the player concludes that the value of node c forthe opponent is greater than �4. Therefore, it is worth at most 4 for the player. Since the valueof node b for the player is 8, there is no point in further exploration of node c and node g can bepruned.

M� cannot infer such constraints on the value of a node for the player based on the node's value

for the opponent. For example, in Figure 6.1(left), knowing that the opponent will have at least�5 for node c, does not have any implications on the value of node c for the player. Actually, nodeg determines the M1 value of node c and of the entire tree and therefore, pruning is prohibited.

A similar situation arises in Maxn. Luckhardt and Irani [66] conclude that pruning is impossiblein Maxn without further restrictions about the players' evaluation functions. Korf [60] showed thata shallow pruning for Maxn is possible if we assume an upper bound on the sum of the players'functions, and a lower bound on every player's function. We use similar restrictions to enablepruning in the multi-model search.

The basic assumption used by the original �� algorithm is that f1(s) + f0(s) = 0 for any gameposition (the zero-sum assumption). This equality holds for any terminal position in zero-sumgames and it seems to be true for non-terminal positions in zero-sum AMG. The non-zero-sumalternating Markov game framework assumes that terminal positions may be evaluated di�erently

67

Page 77: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

a

b c

d e f g

f1=8f0=-6

f1=9f0=-9

f1=4f0=-5

f1=9f0=6

V1=8V0=-6

V1= 9V0= 6

V1=9

?

a

b c

d e f g

f1=8f0=-6

f1= 9f0=-9

f1=4f0=-5

V1= 8V0=-6 V0≥-5

V1=8

| f1 + f0 | ≤ 2

α1=8V1≤2-(-5)≤α1

Figure 6.1: Left: An example of a search tree where ��, using f1, would have pruned node g. However,such pruning would change the M1 value of the tree for the player P 1 = (f1; (f0;?)). V 0 and V 1 representthe M0 and M1 values of the inner nodes of the tree respectively. Right: Pruning is possible given a boundon the sum of the functions, jf1 + f0j � 2.

by the two players. Figure 6.1(left) shows an example wheref1(g)+f0(g) = 15, an unlikely situationsince the merit of node g is high for both players. We shall restrict the framework to bound-sumAMG. If we assume that the players' function are indeed not too far apart, we can relax thezero-sum assumption to jf1(s) + f0(s)j � B, where B � 0 is any positive bound (the bound-sum

assumption). In such a case, although the player's value is not a direct opposite of the opponent'svalue, we can infer a bound on the player's value based on the opponent's value and B.

Figure 6.1 (right) shows a search tree similar to the left tree in the �gure with one di�erence:every leaf l satis�es the bound constraint jf1(l) + f0(l)j � 2. In this case, after searching node f ,the player can induce that node c is worth at least �5 for the opponent and therefore at most 7for the player. The player has already a value of 8 for node a. Thus, node g can be pruned.

For bound-sum AMG ,the bound can be used in the context of the M� algorithm to determinea bound on fi + fi�1 for any game position and therefore for any leaf of the tree. But to enablepruning at any node of the tree, we �rst need to determine bounds on the sum of vi+vi�1 for innernodes, i.e., how these sum-bounds are propagated up the search tree.

Lemma 13 Let s be a node in a game tree and let s1; : : : ; sk be its successors. If there exist non-

negative bounds B1; : : : ; Bn, such that for each successor sj of s, and for each model i,

jM i(sj) +Mi�1(sj)j � Bi, then,

jM i(s) +Mi�1(s)j � Bi + 2Bi�1.

Proof. Assume Mi(s) = M

i(sj) and Mi�1(s) = M

i�1(sk). If j = k, M i(s) and Mi�1(s) were

propagated from the same successor. Therefore, jM i(s)+M i�1(s)j � Bi � Bi+2Bi�1. If j 6= k, s isa node where i is an active model and i�1 is a passive model. Therefore,M i�1(s) andM i�2(s) werepropagated from the same successor sk. Since i� 2 is an active model at s, M i�2(sj) �M

i�2(sk).Hence, M i�1(sk) + M

i�2(sj) � Mi�1(sk) + M

i�2(sk) � Bi�1. It is easy to show that for anysuccessor sj, jM i(sj) �M

i�2(sj)j � Bi + Bi�1: Summing up the two inequalities for sj , we getM

i(s) +Mi�1(s) =M

i(sj) +Mi�1(sk) � Bi + 2Bi�1.

For the second side of the inequality, M i(s)+M i�1(s) =Mi(sj)+M i�1(sk) �M

i(sk)+M i�1(sk) ��Bi � �Bi � 2Bi�1.

According to the lemma, the sum-bounds for each node in the tree at depth d, depend only onthe sum-bounds for the nodes at depth d+ 1. Let Bd

n be the bound on jMn(s) +Mn�1(s)j, where

68

Page 78: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

s is a node at depth d. The bounds for any depth d and any model j can be computed recursivelybefore initiating the search:

Bdj =

8><>:

Bj d = 0Bd�1j j is passive at depth d

Bd�1j + 2Bd�1

j�1 otherwise.(6.2)

Figure 6.2 illustrates the propagation of the sum-bounds up in the tree. The active models areP

2 and P0. P 0 (and therefore P 1) propagates its value from node e. P2 propagates its value from

node d. The theoretical sum-bound computed by Lemma 13 is B12 = B

02 + 2B0

1 = 3. Indeed, atnode b, V2 + V1 = 2:9. To violate the sum-bound of 3 at node b, either V2 should be greater than10.1 or V1 smaller than �7:0. But f2(d) > 10:1 implies f1(d) < �9:1 and f0(d) > 8:1. In such acase, P 0 (and therefore P 1) would have select node d. Similar reasoning applies for f0(e) < �7.

b

d e

V1= -7.1V0=8.1

| f2 + f1 | ≤ 1| f1 + f0 | ≤ 1

f1=-9f0=8

f2=10f1=-7.1f0=8.1

f2=6.1

V2= 10

B1=1B2=1

B1=1B2=3

Figure 6.2: An example for the propagation of the sum-bounds up in the search tree. The active modelsare P 2 and P 0. P 0 (and therefore P 1) propagates its value from node e. P2 propagates its value from noded. The theoretical sum-bound computed by the bound lemma is B1

2 = B02 + 2B0

1 = 3. Indeed, at node b,V2 + V1 = 2:9.

Finding a bound on the sum of the evaluation functions is an easy task for most practicalevaluation functions [59]. Unfortunately, the sum-bounds increase with the distance from the leavesand reduce the amount of pruning. Therefore, using loose bounds will probably prohibit pruning.Note however that for a one-level player, the opponent model is minimax with a sum-bound of zero.Hence, the sum-bound for a one-level player does not increase with depth.

6.3 ���: A Pruning Version of M�

Based on lemma 13, we have developed an algorithm, ���, that searches only necessary branchesof the tree, assuming given bounds on the sums of the Mn values of the tree nodes. ��� takes asinput a position s, a depth limit d, an n-level player Pn, a lower bound � and an upper bound �,and returns the Mn value of s. The formal listing of ��� is shown in �gure 6.3.

6.3.1 The ��� algorithm

The ��� algorithm works similarly to M�, but, in addition, it computes lower and upper cuto�

bounds for the recursive calls and makes pruning decisions:

69

Page 79: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

1. Let s be the current position. The �rst recursive call computes the opponent selection for thesuccessor s0. The upper bound for this call is Bd�1

n �� since if vn�1(s0) � B

d�1n �� � B

d�1n �

vn(s), then vn(s) � Bd�1n � vn�1(s

0) � vn(s0) and s

0 cannot a�ect the Mn value of s. For thelower bound, according to Lemma 13, �Bd�1

n � vn(s0) � vn�1(s

0). Since vn(s0) � vn(s) � �,the lower bound for the call is �Bd�1

n � �.

2. The value returned by this call, vn�1(s0), allows the player to update its lower bound � on

the Mn value of s since vn(s) � vn(s0) � �Bd�1

n � vn�1(s0). At this point, if � � � then

the player's value for this branch is higher than its upper bound and there is no reason forsearching further.

3. The second recursive call computes the Mn value of the state selected by the opponent, s00.The bounds for this call can be computed similarly to the computation for the �rst call usingthe opponent's value vn�1(s

0) = vn�1(s00). By Lemma 13, �Bd�2

n �vn�1(s0) � vn(s

00) � Bd�2n �

vn�1(s0). In addition, node s00 inherits � and � from s. Since we are interested in the most tight

bounds, the bounds for this call are �2 = max(�;�Bd�2n � vn�1(s

0)) and �2 = min(�;Bd�2n �

vn�1(s0)). Obviously, if �2 � �2, we avoid the second recursive call.

4. The rest of the algorithm is identical to ��. The value returned from the second call is usedto update the lower bound � and if � � �, there is no reason to continue searching othersuccessors of s.

Procedure ���(s; d; Pn

; �; �)if d = 0 then return h?; fn(s)ielse

vn �

for each s0 2 �(s)

hs00; vn�1i ����s0; d� 1; On�1

;�Bd�1n � �;B

d�1n � �

�� max(�;�Bd�1

n � vn�1)if � � � then return hs0; �iif d = 1 thenhs000; v0ni hs0; fn(s0)i

else

�2 max(�;�Bd�2n � vn�1)

�2 min(�;Bd�2n � vn�1)

if �2 < �2 then

hs000; v0ni ���(s00; d� 2; Pn

; �2; �2)else v

0n �2

if v0n > vn then

hb; vni hs0; v0ni� max(�; vn)if � � � then return hb; �i

return hb; vni

Figure 6.3: The ��� algorithm

6.3.2 Correctness of ���

70

Page 80: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

p

s

s'

s''

β = n +1dB −α

pα = −Bnd −1 − vn−1(s' )

p

s

s'

s''α2 = −Bnd−2 − vn −1(s' ) β2 = β

p

s

s'

s'' β2 = Bnd− 2 − vn −1(s' )α2 = α

(a) (b)

p

s

s'

s''

β = n +1dB −α

pα = vn (s' ' )

s'''

Figure 6.4: Three types of pruning applied by ���. The arrows indicate the nodes which e�ect the boundsfor pruning. (Left) After computing the decision of the opponent for successor s0. (Middle) Before callingthe second recursive calls. (Right) After computing the player's value for the opponent's decision.

Theorem 14 Let Pn be an n-level player. Let Bn; : : : ; B1 be non-negative numbers such that for

every game position s and n � j � 1 jfj(s) + fj�1(s)j � Bj. Then

���(s; d; Pn

;�1;+1) =M�(s; d; Pn) =M

n(hfn;:::;f0i;d)(s):

Proof. ��� works similarly to M

� but, in addition, it updates cuto� values and makes pruningdecisions. For proving that ��� returns theMn value, it is su�cient to show that any node prunedby ��

� can have no e�ect on the Mn value of its ancestors. There are three types of pruning

applied by ���, demonstrated by Figure 6.4.

1. After computing the decision of the opponent for successor s0, hs00; vn�1(s0)i: In this case ���

prunes when the lower bound exceeds the upper bound, � � � = �Bd�1n �vn�1(s0) � �Bd�2

n �vn�1(s

00) � vn(s00) � vn(s). Since vn(s) � � = B

dn+1 � �p, where �p is the lower bound on

the Mn+1 value of p, the parent of node s, �p � Bdn+1 � vn(s) � vn+1(s). Therefore, node s

cannot a�ect the Mn+1 value of its parent and can be pruned.

2. Before calling the second recursive call, i.e., after computing �2 and �2 and �2 � �2. Thereare two possibilities:

(a) �2 = �Bd�2n � vn�1(s0) and �2 = �. In this case vn(s) � vn(s00) � �Bd�2

n � vn�1(s00) =�Bd�2

n � vn�1(s0) = �2 � �2 = �, and as in the former cases, vn(s) � � and s can not

a�ect its parent value.

(b) �2 = � and �2 = Bd�2n � vn�1(s

0). In this case � = �2 � �2 = Bd�2n � vn�1(s

0) = Bd�2n �

vn�1(s00) � vn(s00). Since vn(s) � � � vn(s00), the value of s00 for the player is too lowand cannot a�ect vn(s).

3. After computing the player's value of the opponent's decision, hs000; vn(s00)i. In this case ���

prunes if � = vn(s00) � �. As in the former case, vn(s) � vn(s

00) � � and node s cannot a�ectthe Mn+1 value of its parent.

Since only nodes that cannot a�ect their ancestors are pruned, ��� returns the Mn value of s.

71

Page 81: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

We have already mentioned that M� reduces to minimax when the player uses its own functionwith a negative sign as a model for its opponent. This is identical to the case of using zero bounds.The following corollary establishes the behavior of ��� when all bounds are zero.

Corollary 1 For zero sum-bounds ��� reduces to ��.

Proof. When all the sum-bounds are zero the n- level player is reduced to Pn = (fn; (�fn; (: : : ?)) : : :). The recursive call of ��� to simulate the opponent search is ���(s0; d� 1; On�1

;��;��),exactly as �� calls itself (in the NEG-MAX version of the algorithm [75]). The second recursivecall will not take place since for any successor s0, �2 � 0 � vn�1(s

0) � �2, and �vn�1(s0) will be

returned as the player's value for this successor. The rest of the algorithm is identical to ��.

The e�ectiveness of ��� depends on the sum-bounds. For tighter sum-bounds the player'sfunctions become more similar to its model's functions (but with an opposite sign). The amountof pruning increases up to the point where they use the same functions in which case ��� prunesas much as ��. For loose sum-bounds, the player's functions can be less similar to the opponent'sfunctions and the amount of pruning decreases. For in�nite sum-bounds ��� performs only simplepruning since only the zero-level searches can still prune.

Knuth and Moore [57] have shown that for any uniform game-tree there is an order of searchthat guarantees that �� performs bb

d2c + b

dd2e � 1 evaluations (the lower bound for computing the

minimax value of a tree). Is there such a best order for ���? In the general case, the answer isnegative. Assume that a player P d traverses a tree T to depth d. Assume that for each model j,Bj > j max

s2leaves(T )fj(s) + max

s2leaves(T )fj�1(s)j. For such trees, ��� must perform all evaluations done

by M� since, at any intermediate stage of the search, there is a possibility for expanding a better

leaf and therefore all branches must be searched.

When the modeling level of the player, Pn = (fn; (: : : ; f0)), is smaller than the depth of search,d, we replace the n-level player by a d-level player, adding a tail of d�n functions with alternatingsigns, P d = (fn; (: : : ; f0;�f0 : : : ; f0)). In such a case, all the d � n deeper sum-bounds are zeroand hence all these d � n deeper searches are reduced to �� according to Corollary 1. Therefore,��

� performs all the pruning done by the simple pruning algorithm when using the same playerPn. In addition, ��� may prune more using the sum-bounds for the higher-level models. Thus,

(bd2c)n

�bb d+n

2c + b

d d+n2e�, the best-case for simple pruning, is an upper bound on the best-case

performance of ���.

Knuth and Moore have also shown the optimality of �� in the sense that any other directionalalgorithm that searches for the minimax value of a tree in the same order as �� does, mustevaluate any leaf evaluated by ��. M�, and its pruning version ���, are not directional algorithms.However, M�

1p is directional. In the following section we describe a pruning version of M�1p and

show its optimality.

6.4 ���1p: A Pruning Version of M�1p

In this section we describe ���1p, an algorithm that incorporates pruning into M�1p. ��

�1p takes as

input a position s, a depth limit d, an n-level player Pn, and for each model i, a lower bound �i

and an upper bound �i. It returns the Mn value of s.

72

Page 82: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Procedure ���1p(s; d; P

n; h�n; : : : ; �0i; h�n; : : : ; �0i)

if d = 0 then return hfn(s); : : : ; f0(s)ielse

v h�1; : : : ;�1i�0n �n

�0n B

dn � �n�1

for each 0 � j < n

if �j+1 < �j+1

�0j �Bd

j+1 � �j+1

�0j B

dj+1 � �j+1

else

�0j �j+1

�0j �j+1

for each s0 2 �(s)

v0 ��

�1p(s

0; d� 1; Pn

; h�0n; : : : ; �00i; h�0n; : : : ; �00i)for each active model jif v

0j > vj then

vj v0j

if j < n then vj+1 v0j+1

�0j max(�0j; vj)

if for every active model n > j � d �0j � �

0j

return hvn; : : : ; vdireturn hvn; : : : ; vdi

Figure 6.5: The ���1p algorithm

6.4.1 The ���1palgorithm

M�1p works similarly to M�, but executes the searches of all the models in parallel. ���1p works like

M�1p, but in addition, it updates cuto� values and makes pruning decisions based on the agreement

of all the models. Any active model has the right to veto a pruning decision. The ���1p algorithmis listed in Figure 6.5.

1. At �rst, the algorithm computes new vectors of cuto� values, �0; �0, from the cuto� valuesinherited from its parent. If model j + 1 has decided to prune at the parent node, i.e.,�j+1 � �j+1, it propagates these cuto� values for model j. Otherwise, the algorithm uses thesame mechanism as in ��

�, �j = �Bdj+1 � �j+1, and �j = B

dj+1 � �j+1.

2. The vector of values returned by the recursive call is used for updating the lower cuto� vector�0. Every active model j updates its lower bound if v0j is greater than its current �0j .

3. At this point the algorithm makes its pruning decision. Every active model j that has amodeler (i.e. j < n) decides to prune by testing whether its value cannot change the modeler'sselection in the parent node, i.e., �j � �j. If all the active models agree to prune, there is noreason to continue searching other successors of s and the algorithm can return.

73

Page 83: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

a

b c

d e f g

f1=8f0=-6

f1= 9f0=-9

f1=4f0=-5

V1= 8V0=-6

V1=8

| f1 + f0 | ≤ 2

α0=−5

α0≥β0

β0=−6 d

e f

g

f1=8f0=-6

f1= 9f0=-9

f1=4f0=-5

V1= 8V0=-6

| f3 + f2 | ≤ 3

α1=8

V3≤3-(-2)≤α3

cbV3=7

α3=7

α3=7

f2=-2

V2≥-2

V1≤2-(-5)≤α1

| f1 + f0 | ≤ 2

a

h i j

α1=8

V0≥-5

Figure 6.6: An example of shallow-pruning (left) and deep-pruning (right) performed by ���1p.

In Figure 6.6 (left), the player already has a value of 8 for node a. Therefore, it sets an upperlimit on the value of node c for the opponent: �0 = B

11 � �1 = 2� 8 = �6. The opponent obtains

a value of �5 which is greater than its upper bound. Therefore, node g is pruned.

Figure 6.6 (right) shows pruning performed at a deeper level of the tree. The player P 0 is readyto prune node j as in the left part of the �gure. However, to stop the search, the player P 2 mustalso agree to prune. P

2 indeed decides to prune, using its �2 which was inherited from its greatgrandparent.

6.4.2 Correctness of ���1p

The following theorem shows that ���1p returns the Mn value of the tree.

Theorem 15 Let Pn be an n-level player. Let Bn; : : : ; B1 be non-negative numbers such that for

every game position s and n � j � 1, jfj(s) + fj�1(s)j � Bj. Then,

���1p(s; d; P

n; h�1; : : : ;�1i; h+1; : : : ;+1i) =M

�1p(s; d; P

n) =Mn(hfn;:::;f0i;d)(s):

Proof. Since the only di�erence between ���1p and M

�1p is that the �rst does not examine certain

nodes, it is su�cient to show that any node pruned by ���1p can have no e�ect on the Mn values

of its ancestors. Therefore, it cannot change the Mn value of the tree.

The algorithm handles lower and upper bounds for every model. These cuto� bounds areinherited from the upper and lower bounds at the parent node. If model j+1 has decided to pruneat the parent node, i.e., �j+1 � �j+1, it propagates these cuto� values for model j. Otherwise,the algorithm uses the same mechanism as in ��

�, �j = �Bdj+1 � �j+1, and �j = B

dj+1 � �j+1. In

addition, the lower bounds for all active models are maximized whenever the search returns thevalues of one of the successors.

Assume that s is at depth d and all active models agree to prune. Looking at active model j,there are two possibilities shown in Figure 6.7:

74

Page 84: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

p

s

s'

α j = vj (s' ) β j = Bj +1d −α j +1

p

s

k

α j +l +1 < β j +l +1

α j +l ≥ β j +l

α j +1 ≥ β j +1

α j ≥ β j

Figure 6.7: The two possibilities when model j agrees to prune: (Left:) At the beginning of the searchat node s, �j < �j , and after evaluating one of the successors, �j is modi�ed to become bigger than �j .(Right:) At the beginning of the search at node s, �j � �j .

1. At the beginning of the search at node s, �j < �j, and after evaluating one of the successors,�j is modi�ed to become bigger than �j. The left part of �gure 6.7 demonstrates the relevantcuto�s at this point of the search. We will show that node s cannot a�ect the M j+1 value ofits parent p. If vj(s) � �j � �j = B

dj+1 � �

pj+1, then vj+1(p) � �

pj+1 � B

dj+1 � vj(s) � vj+1(s)

and node s cannot a�ect the M j+1 value of p.

2. At the beginning of the search at node s, �j � �j . Hence �pj+1 � �pj+1 at the parent node

p. We can continue climbing through the path from node s to the root until we reach the�rst node k in which �j+l � �j+l and for its parent node �j+l+1 < �j+l+1, (the existence ofsuch a node is guaranteed since otherwise the search would have not reached node s). Theright part of Figure 6.7 shows the tree and the relevant cuto� values. Note that model j + l

decides to prune at node k after evaluating one of its successors but the search has continuedand reached node s since other models have not made their decision yet. From the point ofview of model j + l, all searches of all lower models are redundant and node s cannot a�ectthe Mj+l value of node k. Therefore node s can be pruned from the point of view of model j.

Since all active models agree to prune, node s cannot a�ect the root value and can be pruned.

We have already mentioned that ��� reduces to �� when all the sum-bounds are zero. Thesame phenomenon occurs for ���1p. When all the sum-bounds are zero the n-level player is reducedto Pn = (fn; (�fn; : : : ;?) : : :). The algorithm performs in parallel n identical �� searches since alllower bounds are equal and so are all the upper bounds. ���1p handles the cuto� values exactly as�� handles its own in the NEG-MAX version. Therefore, ���1p evaluates exactly the same positionsas �� does.

Corollary 2 For zero sum-bounds ���1p reduces to ��.

75

Page 85: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

6.4.3 Optimality of ���1p

Can we do better than that? The next theorem shows the optimality of ���1p in the sense that anyother directional algorithm that searches for the Mn value of a game-tree must examine any leafexamined by ���1p.

Theorem 16 Every directional algorithm that computes the Mn value of a game tree with positive

sum-bounds must evaluate every terminal node evaluated by ���1punder the same ordering of the

search.

Proof. To prove this theorem we use a method similar to the one used by Korf [60] for showingthe optimality of �� pruning in a multi-player search. Assume that A is a directional algorithm forcomputing Mn, and k is a leaf node in the search-tree T , such that k is examined by ���1p and notexamined by A when both algorithms search for the Mn value of T under the same order of search.Consider the state of ���1p just before evaluating node k. It consists of lower and upper bounds,�j ; �j, for each model j, and since node k is not pruned, there is an active model i at the parentof node k such that �i < �i. Let us construct a new game-tree, T 0, by removing all the branchesfound to the right of the path from the root to k, making node k the last frontier right leaf of T 0.Figure 6.8 illustrates the search-tree and the relevant bounds. The core of the proof is based onthe claim that whenever an active model j vetos a pruning decision at a given node, the searchunder this node must be continued. We will show that by a careful assignment of values to node kit can determine, or a�ect, the Mn value of T 0 and therefore cannot be pruned by any directionalalgorithm.

g

p

s

kki

si

pi

αj<βj

αj+1<βj+1

αj<vj<βjvj+1=Bj+1-Vj

vj>βjvj+1=Bj+1-Vj

OR

Figure 6.8: An illustration for the proof of the optimality of ���1p. Every directional algorithmmust examinenode k. Each of the two alternative assignments a�ects di�erently the value of g.

Let d be the depth of node k. To make the proof more comprehensible we �rst prove the theoremfor shallow trees. For d = 2, if node k is not pruned by ���, then �n�1 < �n�1 at node s. We shallassign �n�1 < vn�1(k) < �n�1 and vn(k) = B

d�2n � vn�1(k) to node k. This assignment propagates

up in the tree, vn�1(s) vn�1(k) and vn(s) vn(k). It is easy to show that vn(k) > �n at theroot p. Since �n at node p is the maximal Mn value of all the subtrees searched so far, node kdetermines the Mn value of the root and must be evaluated by any search algorithm.

76

Page 86: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

For d = 3, if node k is not pruned, then �n�2 < �n�2 at node s. In this case we will show twodi�erent assignments to node k that a�ect the selection of model n � 1 at node p and thereforea�ect theMn value of the root. First, assign �n�2 < vn�2(k) < �n�2 and vn�1(k) = B

d�3n�1�vn�2(k)

to node k. A similar analysis to the previous case shows that node k determines the Mn�1 value ofnode p, vn�1(p) vn�1(k). Second, assign vn�2(k) > �n�2 and vn�1(k) = B

d�3n�1 � vn�2(k). In this

case node si determines the Mn�1 value of node p. It follows that the selection of model n � 1 at

node p cannot be determined without evaluating node k. The Mn value of node p is determinedaccording to the selection of model n � 1, and since �n�1 < �n�1 at node p, it has a direct e�ecton the Mn value of the root and node k cannot be pruned.

Similar arguments hold for deeper nodes in the tree. Let j be the higher model with �j < �j

at node s. If j = n � 1 we have already shown that node k can determine the Mn value of itsgrandparent. Since this value is bigger than �n, node k determines the Mn value of the tree andcannot be pruned. If j = n�2, node k can determine the selection of model n�1 at its grandparentand therefore the Mn value of its great grandparent. This value cannot be determined withoutevaluating node k. The in uence of node k climbs through the path up to the root and thereforenode k cannot be pruned.

If j < n� 2, we will use the same two di�erent assignments for node k to a�ect the decision ofmodel j + 2 at node g and therefore to a�ect the M j+3 value of node g. For the �rst assignment,let �j < vj(k) < �j and vj+1(k) = B

dj+1 � vj(k). In this case, �j+1 < vj+1(k) at node p, and

therefore node s determines the M j+2 of node p. For the second assignment, let vj(k) > �j andvj+1(k) = B

dj+1� vj(k). In this case node si determines the M

j+2 of node p. The decision of modelj + 2 at node g is determined according to vj+2(p) which depends on node k and therefore a�ectsvj+3(g). The chain of in uence of node k on the decisions of higher models continues climbing inthe tree up to the root. Therefore, node k must be evaluated by any search algorithm.

Since algorithm A searches the game-trees T and T 0 exactly the same, A must also prune node kin T 0 in contradiction to the fact that node k can determine or a�ect theMn value of T 0. Therefore,node k must be examined by any directional algorithm.

6.5 Average Case Performance - Experimental Study

We conducted a set of experiments to measure the average performance of the pruning algorithmson random uniform trees. To build a random uniform tree for a given branching factor b, depth d,modeling level n, and a sum-bound B, we assign a vector of n+1 random values to every leaf s suchthat for any model j, jvj(s)+vj�1(s)j � B. (We use the same bound for all models.) The values arerandomly drawn from a uniform distribution over a given interval [�P; P ] ([-10000,10000] for thefollowing experiments). The dependent variable used for the experiments is the e�ective branchingfactor (EBF), d

pL, where L is the number of leaves evaluated. We look at four independent variables:

the sum-bound B (to make the number more meaningful we show it as a percentage of P ), thedepth d, the branching factor b and the modeling level n.

In each of the following experiments, we measured the EBF of ��� and ���1p as a function of theindependent variables by running each algorithm on 100 random trees and averaging the results.

Figure 6.9 shows the average EBF of both algorithms as a function of the sum-bound for variousmodeling levels. The branching factor is 4 and the depth is 10. For ML = 0, both algorithmstraverse the trees exactly like �� and the EBF does not depend on the sum-bound. For non-zerolevels, the EBF increases with the increase of the sum-bound and the increase of the modeling

77

Page 87: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

3

3.2

3.4

3.6

3.8

4

4.2

0 20 40 60 80 100

effe

ctiv

e b.

f.

Bound(%)

ABstar: BF=4, depth=10

ML=0ML=1ML=2

3

3.2

3.4

3.6

3.8

4

4.2

0 20 40 60 80 100

effe

ctiv

e b.

f.

Bound(%)

ABstar-1p: BF=4, depth=10

ML=0ML=1ML=2

Figure 6.9: The average EBF of ���(left) and ���1p (right) as a function of the sum-bound.

level. Note that due to simple pruning, the average EBF of both algorithms converges to numberssmaller than the branching factors of the non-pruning versions (by Lemma 9 b

� = �4 = 4:83 forM

�, and b� = 4 for M�

1p). For in�nite sum-bounds (100%), ��� performs only simple pruning and

the EBF converges to 3.25 for ML=1, and 3.6 for ML=2, which is much less than the theoretical

upper bounds according to Lemma 12, 10

q(43)410 = 4:1, and 10

q(43)2410 = 4:24, respectively.

Note also that ��� prunes much more then ���1p. We hypothesize that this phenomenon is a

result of ��� being a non-directional algorithm. ���1psimulates the searches of all the models in

parallel. Therefore, pruning decisions must be agreed by consensus. ��� performs the searches

serially, allowing more elaborated pruning due to the information gathered in previous searches.

1.7

1.75

1.8

1.85

1.9

1.95

2

2 4 6 8 10 12 14 16 18 20

effe

ctiv

e B

F

depth

RABstar: ML=1, BF=2

Bound=0Bound=5%

Bound=15%

1.7

1.75

1.8

1.85

1.9

1.95

2

2 4 6 8 10 12 14 16 18 20

effe

ctiv

e B

F

depth

ABstar-1p: ML=1, BF=2

Bound=0Bound=5%

Bound=15%

Figure 6.10: The average EBF of ���(left) and ���1p (right) as a function of the search depth.

Figure 6.10 shows the average EBF of both algorithms as a function of the search depth forvarious sum-bounds. The modeling level is 1 and the branching factor is 2. For a zero sum-bound,both algorithms reduce to ��. For a non-zero sum-bound �� has an advantage over ���. Thegraphs indicate that this gap increases with depth for ���1p, but slightly decreases for ���. Notethat the slopes of the graphs emphasize the better performance of ��� over ���1p for deeper trees.

Figure 6.11 shows the average EBF of both algorithms as a function of the branching factorfor various modeling levels. The sum-bound is 5% and the depth is 6. The results emphasize thebetter performance of ��� over ���1p. The graphs show that the gap between �� and ��

� slowlyincreases with the branching factor.

78

Page 88: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

1

2

3

4

5

6

7

8

9

10

2 3 4 5 6 7 8 9 10

effe

ctiv

e b.

f

b.f.

RABstar: Bbound=5%, depth=6

ML=0ML=1ML=2

1

2

3

4

5

6

7

8

9

10

2 3 4 5 6 7 8 9 10

effe

ctiv

e b.

f

b.f.

ABstar-1p: Bound=5%, depth=6

ML=0ML=1ML=2

Figure 6.11: The average EBF of ���(left) and ���1p (right) as a function of the branching factor.

Simulations with arti�cial game trees can only be regarded as approximations to the performanceof the algorithms in real-life game trees [77]. The next chapter considers practical issues regardingmulti-model search in real game-playing programs.

6.6 Summary

Opponent modeling search requires pruning methods to become e�ective. Korf [58] raised doubtswhether pruning is possible in opponent modeling search. Iida [48] describe the �-pruning algorithmfor one-level search. This method is restricted to the pruning performed by the �� simulation ofthe opponent. The simple pruning algorithm described in Section 6.1 is a generalization of thistechnique to multi-model search.

This chapter deals with possibility of pruning in multi-model search. We show that pruningis impossible without further restrictions on the player's and the opponent's evaluation functions.We prove a su�cient condition for pruning and present two algorithms that utilize this conditionfor pruning. We prove correctness and optimality and study the complexity of the algorithms.

The pruning power of the algorithms depends on the sum-bounds. We conducted a set ofexperiments in arti�cial random trees. The results indicate that for small sum-bounds, the pruningachieved by ��� is signi�cant and close to that achieved by ��.

The multi-model framework allows the use of arbitrary opponent models. This work pointsout that pruning is signi�cantly reduced for a model which is radically di�erent from the player'sstrategy. However, for two-player zero-sum board games, we do not expect the players to evaluateboards in a radically di�erent way. In such cases, the pruning power of ��� is signi�cant.

79

Page 89: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 7

Practical Issues Regarding

Multi-model Search

There are several issues that should be considered when trying to apply the theoretical multi-model framework to practical game-playing programs. The most important issue is the acquisitionof the opponent model. If the playing style of the opponent is known in advance, an appropriateevaluation function that models this style can be constructed. In chess, for example, if it is knownthat the opponent plays defensively, the user will increase the weight of the \king defense" featurein the model's evaluation function. If past games of the opponent are available, such a processcan be automated using book-learning techniques [25, 103, 86]. Section 7.2 deals with this issue inmore detail.

Another issue that should be considered is the acquisition of the bounds on the absolute sum ofthe player's functions. We assume that the player's and the model's evaluation functions are givenand we look for a bound on their absolute sum for any game position. Many practical applicationsuse evaluation functions that are linear combinations of a given set of board terms, (a1; : : : ; ak),such as material advantage, mobility, center control etc.

f(b) =kX

j=1

wjaj:

When the player's and model's evaluation functions are weighted sums over the same set of terms,it is possible to analytically compute a non-trivial bound over their sum. The sum of such twofunctions, f1(b); f0(b), is:

f1(b) + f0(b) =kX

j=1

(wj1 + w

j0)aj:

The di�erence between the two functions is re ected by di�erent weights for some of the boardterms. Terms that are weighted symmetrically by the two functions do not a�ect the sum sincewj1+w

j0 = 0. For two non-symmetric functions (that are not summed to zero), at least one term is

not weighted symmetrically, i.e., wj1 + w

j0 6= 0.

For example. let us look at two simple evaluation functions for chess, consisting only of materialadvantage. The two functions use the common weights for most pieces: 9 for a queen, 5 for a rookand 1 for a pawn. They only di�er in the weights for the bishop and the knight. f1 weights bishopsby 3.1 and knights by 2.9. f0 does the opposite { 2.9 for bishops and 3.1 for knights. The absolute

80

Page 90: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

sum of these functions is

jf1(b) + f0(b)j = j(3:1� 2:9)bishop-advantage+ (2:9� 3:1)knight-advantagej= 0:2 jbishop-advantage+ knight-advantagej :

Since the maximum material advantage for both terms is 2, a sum-bound for these functions canbe obtained by 0:2(2+ 2) = 0:8.

a

b c

f g

f1=-1f0=+1

V1= 0

α1=0

| f1 + f0 | ≤ 0.8

V1≤ -0.2V0≥1

p

a

b c

f g

f1=0f0=0

V1= 0

α1=0

| f1 + f0 | ≤ 0.8

V1≤ 0.8V0≥0

l

L

n?

Figure 7.1: An example of a practical application of ���. The player's function prefers bishops over knightsand weights them by 3.1 and 2.9 respectively. The model's function prefers knights over bishops and weightsthem by 3.1 and 2.9 respectively. The sum of the two functions is bound by 0.8. In the left tree, ��� prunesnode g just as �� would, since node b is worth at most �0:2 for the player. In the right tree, �� with f1would prune node g. ��� would not prune this node since node b has an upper bound of 0.8. For example,if the move leads to node g is a knight capture, the model will prefer node g (with a value of 0.2). In such acase node g and therefore node c is worth 0.2 also for the player and determines the Mn value of the tree.

Figure 7.1 shows two examples of ��� searches using the above functions, f1 and f0. In theleft tree, the move leads to node f is a pawn capture. ��� prunes node g just as ��, since nodec is worth at least +1 for the model and therefore at most �0:2 for the player. This exampledemonstrates that if the two functions are not radically di�erent, ��� can prune in spite of the nonzero-sum property of its functions. Note that if the move leads to node f would have not been acapture move, c would be worth at least 0 to the model and at most 0.8 for the player, hence ���

would not prune node g while �� would.

In the right tree, the move leads to node c involves a black bishop capture and the move leadsto node f is a white bishop capture. �� with f1 prunes node g. ��

� does not prune this nodesince node c has an upper bound of 0.8. Node c could be worth more than 0 for the player if,for example, the move leads to node g is a knight capture. In this case, the value of node g forthe model would be �2:9(+1)� 3:1(�1) = 0:2. Therefore, the model would prefer node g (with avalue of 0.2) over node f (with a value of 0). Surprisingly, the value of node g for the player wouldalso be 3:1(+1) + 2:9(�1) = 0:2. Therefore, node g determines the Mn value of the tree and cannot be pruned. Note that position g is an example for a non-terminal position where the zero-sumassumption does not hold. The two players prefer exchanging a white knight for a black bishopover a bishop exchange.

The above discussion is appropriate for cases where the two functions use the same terms withdi�erent weights. In other cases, if we do not have an analytical way of determining a bound onthe sum of functions, we can use sampling methods. Given two functions, we can perform a largenumber of simulated games and collect statistics about the sum of the two functions for boardssearched during these games. This statistics can be used for estimating a bound on the sum of thefunctions for any desired con�dence.

81

Page 91: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

7.1 Multi-model Search in the Checkers Domain

To test the applicability of the multi-model framework to real domains, we conducted some ex-periments with ��

� in the checkers domain. The �rst experiment tests the amount of pruning of��

� while searching checkers game trees. The functions f1 and f0, used by the player and themodel, are both weighted sum of six terms taken from Samuel's function [85] (material advantage,mobility, center control etc.), and a seventh term, Total, which counts the number of pieces onthe board. The two functions weights the six terms symmetrically (wj

1 = �wj0; j = 1; : : : ; 6). The

seventh term, Total, is given a the same negative weight by both functions. That means that bothfunctions prefer piece exchange whenever possible.

The bound on the sum of the player's and the model's functions is:

jf1(b) + f0(b)j =��w7

1Total(b) + w70Total(b)

�� � ��(w71 + w

70)��Total(b):

The board at the root of the game tree, r, can be used for computing the sum-bound, sinceTotal(r) is an upper bound on the number of pieces at the leaves of the tree. Moreover, since weuse a one-level player in this experiment, the sum-bound does not increase with depth.

Figure 7.2 shows the average EBF of ��� and ���1p, compared to ��, as a function of the searchdepth in the domain of checkers. The results were obtained by counting the number of evaluationsper move performed by the algorithms in a tournament of 100 games between them and an ��

player. ��� and ���1p use the player P

1 = (f1; (f0;?)), �� uses the function f0.

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

1 2 3 4 5 6 7 8 9 10

effe

ctiv

e B

F

depth

Checkers: ABstar, ML = 1

ab-staralpha-beta

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

1 2 3 4 5 6 7 8 9 10

effe

ctiv

e B

F

depth

Checkers: ABstar-1p, ML = 1

ABstar-1palpha-beta

Figure 7.2: The average EBF of ���and ���1p as a function of the search depth in the checkers domain.

Note that the performances of both algorithms are similar to their performances for randomtrees. ��� and ���1p prune less than �� but the average EBF decreases signi�cantly with the searchdepth.

The last results raise an interesting question. Assume that we allocate a modeling player andan non-modeling opponent the same search resources. Is the bene�t achieved by modeling enoughto overcome the extra depth that the non-modeling player can search due to better pruning?

To answer this question we have tested iterative deepening version of ��� against a real checkersprogram based on iterative deepening ��. Both programs were allocated the same amount of searchresources per move. We measured the search resources by the number of leaf evaluations availablefor a move. Table 7.1 shows the results obtained by ���, with the player P 1 = (f1; (f0;?)), against��, with the function f0, for various values of resource limit (leaf evaluations per move). Each rowrepresents the results of a tournament of 1,000 games.

82

Page 92: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Evaluations-per-move Wins Draws Losses Points ��� depth �� depth

100 192 646 162 1.030 2.84 3.21200 287 535 178 1.109 3.24 3.90300 263 567 170 1.093 3.64 4.24400 261 566 173 1.088 3.92 4.43500 210 603 187 1.023 4.11 4.53

Table 7.1: The results obtained by an iterative deepening version of ��� when played against aniterative deepening version of ��. Both algorithms were allocated the same search resources { leafevaluations per move. Each row represents a tournament of 1,000 games. The last two columnsshow the average search depth of the two algorithms.

We can see that in all cases ��� searched shallower than �� due to its reduced pruning power.However, the deeper search of �� was outweighed by the bene�t of modeling. When playing againstan �� player using f0, ��

�, using the player P 1 = (f1; (f0;?)), uses the exact opponent functionas a model, in contrast to �� which wrongly uses �f0 as an opponent model. As a result, whilethe player prefers exchange positions, �� wrongly assumes that it prefers to avoid them.

The reader should note that these results are for a speci�c game with speci�c evaluation func-tions. The tradeo� between the bene�t of modeling and the cost of reduced pruning remains openfor further research.

7.2 Learning an Opponent Model in Zero-sum Board Games

In this section we describe an algorithm for learning an opponent model and demonstrate howM�,

combined with the learning algorithm, can be incorporated into a real game-playing program. Thelearning algorithm assumes that the opponent is a minimax player, and therefore it is only suitablefor one-level player. Portions of his section have appeared earlier in [15].

7.2.1 The learning algorithm

In this section we assume that the learner is exposed to a source of examples of the opponent'sbehavior. Such a source can be realized on-line during playing with the opponent, by observingthe opponent's games with others, or by analyzing past available games of the opponent that havebeen published in the literature. (In chess for example, it is common make all the games played inimportant tournaments available to the public).

An example is a pair hb; '(b)i, where b is a game board and '(b) is the move chosen by theopponent. The algorithm gets a set of examples E, a vector of board features h and a depth d,and returns a coe�cient vector w. The goal of the algorithm is to �nd an evaluation functionf0(b) = w �h(b) =P

iwihi(b), such that the moves selected by a minimax player searching to depthd and using f0 will be consistent (as much as possible) with the opponent example moves.

Assume that the alternative moves for board b is b1; : : : ; bn and assume that the opponentselected, w.l.o.g, b1. This indicates that the opponent prefers b1 over the other alternatives. As-suming that the opponent decision procedure ' can be approximated by a polynomial w � h, theabove set of preferences would induce a set of constraints over the coe�cient vector w: fw �(h(b1)�h(bi)) � 0 j i = 2; : : : ; ng. The set of constraints for E could then be solved to induce a coe�cientvector w. This method was used by van der Meulen [103] for book learning.

83

Page 93: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

However, assuming that the opponent is a minimax player, its decisions are based on searchingto deeper levels. Therefore, instead of using b1; : : : ; bn, we use the dominant boards { the boards thatdetermine the minimax values of b1; : : : ; bn. Hsu et al.[45] and Schae�er et al. [89] used Dominant

boards with linear regression for book learning developed by Christensen and Korf [25].

The algorithm works iteratively by increasing the depth j up to d. For j = 1 the algorithm infersa vector w1 solving constraints as described above. For j > 1, the algorithm starts with wj = wj�1

and performs a hill-climbing search, improving wj until no further signi�cant improvement can beachieved. In each step of the search, the algorithm generates a set of constraints over the set ofexamples E. For each example hb; '(b)i 2 E, the algorithm performs a minimax search to depthj�1, starting from each of the successors of the example board, using f0 = wj �h. Each successor isnow associated with the dominant board that determines its minimax value. Let bi be the dominantboard associated with bi. The algorithms generates the following n� 1 constraints:

fw � (h(b1)� h(bi)) � 0 j i = 2; : : : ; ng:

Let C be the union of the sets of constraints of all the examples. The next stage consists ofsolving the inequalities system C, i.e., �nding w that satis�es C. The method we used is a variationof the linear programming method used by Duda and Hart [29] for pattern recognition. Before thealgorithm starts its iterations, it sets aside a portion of the examples for progress monitoring. Thisset is not available to the procedure that builds the constraints. After solving C, the algorithmtests the solution vector by measuring its accuracy in predicting the opponent's moves for the testexamples. The performance of the new vector is compared with that of the current vector. If thereis no signi�cant improvement, we assume that the current vector is the best that can be foundfor the current depth, and the algorithm repeats the process for the next depth using the currentvector for the initialization. The learning procedure listed in �gure 7.3.

We tested the learning algorithm in the checkers domain by the following experiment. Twoplayers were used as opponents, using two variations of Samuel's evaluation function [85], searchingto depth 8. The players played against each other until 1600 examples were generated and givento the learning algorithm. The algorithm was also given a set of ten features, including the sixfeatures actually used by the functions.

The algorithm was run with a depth limit of 8. The examples were divided by the algorithm intoa training set and a testing set of size 800. For each of the 8 depth values, the program performed2-3 iterations before moving to the next depth. In each iteration we solved a set of several thousandconstraints by using linear programming1.

The results of the experiment for the two functions is shown in �gure 7.4. The graph shows thenumber of successes of the learned models in predicting the opponent's moves over the test set ofexamples. The algorithm succeeded for the two cases, achieving an accuracy of 92% and 100%.

7.2.2 OLSY: An Opponent Learning System

In order to test how well the M� algorithm and the learning algorithm can be integrated, we havebuilt a game playing program, OLSY, that is able to acquire and maintain a model of its opponentand use it to its advantage. The system consists of two main components, a game-playing programbased on M

�1p and a learning program based on LearnModel procedure. The learning program

accumulates the opponent moves and, after each k moves (10 in our experiments), updates theopponent model to agree with the new examples.

1We used the very e�cient lp solve program, written by M. R. C. M. Berkelaar, for solving the constraints system.

84

Page 94: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Procedure LearnModel(E; h; d)T a subset of EE E � T

w0 1for j from 1 to d

wj wj�1

Repeat

wold wj

wj FindSolution(E,wj; h; j)Until jScore(wj; T )� Score(wold; T )j < �

return wd

FindSolution(E;wj; h; j)C �

for each hb; '(b)i 2 Efor each bi 2 �(b)

bi minimax(bi; j � 1; wj � h)C C [

nw

�h('(b))� h(bi)

�� 0

oreturn w that satisfy C

Figure 7.3: LearnModel: An algorithm for learning an opponent model from examples.

The OLSY system was tested by letting it play a sequence of checkers games against minimaxplayers (all programs used pruning but it did not a�ect the results since all algorithms search tosame depth and no time limit was given.) OLSY and its two opponents each used a di�erentfunction but with a roughly equivalent playing ability. They all search to depth 6 and they all usedvariations of Samuels' static evaluation function. We stopped the games several times to test howwell OLSY performs against its opponents. The test was conducted by turning o� the learningmechanism and performing a tournament of 100 games between the two.

Figure 7.5 shows the results of this experiment. The players start with almost equivalent ability.However, after only few examples, the learning program becomes signi�cantly stronger than itsnon-learning opponents. The performance of OLSY kept increasing until about 50 examples wereprocessed. After that the learning curve was leveled. We also tested the accuracy of the opponentmodels as determined by the learning procedure. After 100 examples, OLSY succeeded in acquiringmodels for its two opponents with a quality of 95% for the �rst and 98% for the second. Learningmore examples did not cause any further modi�cation of the models.

The learning algorithm described in this chapter is only suitable for one-level models. It isbased on the assumption that the opponent decision procedure is a minimax search. The problemof learning deeper level models is still open and deserves further research.

85

Page 95: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

succ

ess

depth

Learning the opponent model

f1f2

Figure 7.4: Learning opponent models from examples.

1.1

1.15

1.2

1.25

1.3

1.35

0 50 100 150 200

poin

ts p

er g

ame

examples no.

OLSY performance

Opp. 1Opp. 2

Figure 7.5: The performance of the learning system as a function of the number of examples measured byaverage points per game.

86

Page 96: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Chapter 8

Conclusions

This chapter summarizes the research by brie y reviewing the main issues, discusses some advan-tages and limitations of the suggested approach, and mentions some directions for future research.

8.1 Summary

This work presents a model-based approach for learning interaction strategies in multi-agent sys-tems. In this approach, the model-based agent learns models of the other agents for predictingtheir future behavior according to their past behavior. The model-based strategy lets the agentadapting to the others during interaction, and reduces the number of interaction examples neededfor adaptation.

Interactions among agents are modeled by the game-theoretic concept of repeated-game. Wepresent an architecture for a model-based on-line learner. The learner accumulates the historyof the repeated game. The history is given to a learning procedure which generates a consistentmodel. The model is then given to a procedure that infers a best-response strategy which is usedto decide on the action that should be taken. This process is repeated throughout the game.

Inferring best-response strategies is computationally hard. Therefore, a computational-boundedlearner should limit the class of strategies available for representing its opponent model. This workconsiders regular opponents, i.e., their strategies can be represented by a �nite-state machine, forwhich the best-response strategy can be e�ciently computed. Learning a minimal DFA modelwithout a teacher was proved to be hard. We present an unsupervised algorithm, US-L�, basedon Angluin's L� algorithm. The algorithm e�ciently maintains a model consistent with its pastexamples. When a new counterexample arrives, it tries to extend the model in a minimal fashion.

An online learner that uses its current best-response strategy to select an action may leavesome aspects of the opponent's strategy unexplored. This can lead to sub-optimal interactionstrategy. There are many exploration strategies, developed in the reinforcement learning paradigm,for exploring unknown domain to be learned. In Chapter 3 we have shown how to incorporate thesemethods into model-based learning.

One problem with the existing exploration methods is that they do not take into considerationthe risk involved in exploration. An exploratory action taken by the agent tests unfamiliar aspectsof the other agent's behavior which can yield a better model of the other agent. However, such anaction carries also the risk of putting the agent into a much worse position. We have presentedthe lookahead-based exploration strategy that evaluates actions according to their expected utility,

87

Page 97: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

to their expected contribution to the acquired knowledge, and to the risk they carry. Insteadof holding one model, the agent maintains a mixed opponent model that re ects its uncertaintyabout the opponent's strategy. Every action is evaluated according to its long run contributionto the expected utility, and to the knowledge regarding the opponent's strategy, expressed by theposterior probabilities over the set of models. Risky actions are detected by considering theirexpected outcome according to the alternative models of the opponent's behavior.

We conducted a set of experiments where the adaptive model-based agent played against ran-domly generated opponents. In these experiments, adaptive agents performed signi�cantly betterthan non-adaptive agents, and exploring adaptive agents performed better than non-exploringadaptive agents. We also compared a model-based adaptive agent with Q-agent. Given the samenumber of examples, the MB-agent performed signi�cantly better than the Q-agent. This resultsupports our claim that it is more e�ective to learn a model and use it for designing a best-responsestrategy than to try to directly learn such a strategy. We also showed that the MB adaptive learner,without any background knowledge, outperforms hand-crafted opponents in the IPD domain.

The second part of the work describes opponent modeling in Alternating Markov games. Wepresent multi-model adversary search, a generalized version of minimax, that can incorporate oppo-nent models into the search. An opponent model is a recursive structure consisting of an evaluationfunction and the player model held by the opponent. The M� algorithm simulates the opponent'ssearch to determine its expected decision for the next move and then evaluates the resulted stateby recursively searching its associated subtree using the player's own strategy.

One of the reasons for the tremendous success of minimax is the �� pruning technique. Multi-model search requires similar pruning methods to become competitive. Chapter 6 presents a generalframework for pruning in multi-model search. We prove a su�cient condition for pruning based ona bound on the sum of the player's and the model's evaluation functions. This condition is a similarto the one used by Korf [60] for multi-player search. We then present two algorithms that utilizethis restriction for pruning and study the complexity of the algorithms. The pruning power of thealgorithms depends on the sum-bounds. We conducted a set of experiments in arti�cial randomtrees. The results indicate that for small sum-bounds, the pruning achieved by ��

� is signi�cantand close to that achieved by ��.

There are several other issues that should be considered when trying to apply our theoreticalframework to practical game-playing programs. The most important issue is the acquisition of theopponent model. In Chapter 7 we present a learning algorithm that infers an opponent model basedon examples of its past decisions. The algorithm generates a large set of inequalities, expressingthe opponent preference of its selected moves over their alternatives, and uses linear programmingto solve this set of inequalities. The current version of the learning algorithm is only useful for aone-level player. It is an interesting research problem to generalize it for learning recursive models.However, we expect that one-level model will be su�cient for most practical purposes.

Another issue is the e�ect of modeling error on the performance of M�. In previous work [17]we demonstrated how an error in the model can deteriorate performance signi�cantly. Regression-based book-learning algorithms can supply, in addition to the function approximation, also an errorestimation. When such an error estimation is supplied by the learning program or by the user, theM

�� algorithm can be used. The algorithm converges toM� when the error bound approaches zero,

and converges to minimax when the bound goes to in�nity.

Experiments performed in the domain of checkers demonstrated the advantage of M� overminimax. ��

�, the pruning version of M�, is more restricted than �� in its pruning decisions.However, even when the two algorithms were allocated equivalent search resources, ��� was able

88

Page 98: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

to overcome its reduced depth of search, by exploiting the opponent model. Is this result speci�cto our experimental domain, or is it a general phenomenon? Under what conditions does thebene�t of modeling outweigh the cost of reduced pruning? We do not have a conclusive answer forthese questions. We hypothesize that the answer depends on the particular domain and particularevaluation functions used.

8.2 Discussion

Planning an e�ective strategy for handling interactions with other agents is extremely di�cult dueto the ignorance about the other agents' behavior. A common approach to deal with this ignoranceis to assume the worst case { the maximin decision rule [65] for example, or, the analog minimaxalgorithm used by game-playing programs.

The worst case assumption about the opponent's strategy is a secure playing style. It assumesthat the opponent always chooses the worst alternative, from the agent's point of view, and insuresthe agent from unexpected risks { the agent can only be surprised for good. However, the worst-case assumption misses opportunities, especially in situations when the expected outcome of a giveninteraction is prefered by both agents. Non-zero sum games provide a simple example for this kindof interaction.

The maximin decision rule, or the minimax algorithm, are surely not optimal when an informa-tion about the opponent's strategy is given in advance. A rational behavior is to exploit the givenknowledge for maximizing the expected pro�t. This work provides two examples of how this kindof knowledge can be exploited for better decision-making.

Finding the optimal response against a given opponent model is intractable in the general case.However, even if the best-response problem is too di�cult, the opponent model can still be usedfor simulation. The agent can simulate di�erent responses against the given model for choosinga proper one. This method is similar to the one used by the Dyna architecture for reinforcementlearning [97].

The main role of the opponent model is to predict its behavior in the future.in Choosing aproper class of opponent models is essential for the success of the model-based strategy. If themodel class is too restricted it will probably fail in prediction. On the other hand, a too generalclass will make the best-response problem and the learning problems intractable. Often, thereare many ways to model a given behavior. Therefore, general background information about theopponent, or its type, is important for discriminating among di�erent possible model classes.

The work describes two extensions of simple opponent models. In the repeated game frameworkwe extended the opponent models to be represented by mixed strategies. This extension lets theagent to better deal with its uncertainty. In the multi-model search framework, the opponent modelis represented by a recursive structure that includes evaluation function and the player model heldby the opponent. In previous work [16] we described an extended opponent model that in addition tothe opponent's function, it also models the opponent's depth of search. This extension is especiallyuseful for weaker opponents, allowing to set traps for example. The opponent model can be furtherextended to include other components related to the opponent's playing strategy such as open-gameand end-game strategies, time-allocation strategy, etc.

Another question concerned with the model-based strategy is the acquisition of the opponentmodel. Learning an opponent model is similar to learning from a teacher. By using past interactionexamples, an agent can adapt its opponent model, treating the opponent's past actions as classi�ed

89

Page 99: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

examples of its strategy. Khardon [55] describes a framework in which the agent learns to actby inferring a strategy consistent with a set of examples supplied by a teacher. In the 'Learningto act' framework, strategies are represented by production rules. Production rules, as well asother representation schemes, can be incorporated into our Model-based framework (replacing theDFA), provided that e�cient learning algorithm and best-response procedure are given for thisrepresentation.

One of the basic assumptions of our framework is that the opponent is stationary, i.e., it doesnot modify its strategy through the course of the interaction. Learning a non-stationary opponentis a manifestation of adaptation to changing environment. In general, the question of learninga non-stationary opponent is di�cult and remains open for future research. However, when theopponent does not change its strategy too often, learning algorithms can still be applied usingsimple windowing of the input rather than the complete history, and increasing the exploratoryrate of the algorithm.

The experimental results of the model-based strategy demonstrate average-case performance. Inthe worst-case, Fortnow and Whang [34] show that for any learning algorithm there is an adversaryDFA for which the learning process will converge to the best response in at least exponential numberof stages. One way of dealing with this complexity problem is by limiting the space of automataavailable for the opponent [35, 80, 70]. Another alternative is to limit the kind of interaction [34].For simple games such as Penny-matching, there is no need for a complete opponent model. Thebest response strategy can be inferred based on the DFA learning method of Rivest and Schapire[79]. Such learning methods embark on long exploration sequences during the course of the game.The high cost of such exploration sequences diminishes in in�nite games for the limit-of-the-meanutility function which measures only asymptotic performance. However, these methods may failfor the discounted-sum utility which takes into account also immediate rewards. The explorationmethods described in Chapter 3 takes into account the cost of exploration and is therefore suitablefor such utility functions as well.

Finally, the research described in this work has uncovered a number of interesting directions forfurther research. The next section mentions some of them:

8.3 Future Work

Dealing with non-stationary opponents: One of the implicit assumptions of the model-basedstrategy is that the opponent is stationary, i.e., it does not change its strategy during thecourse of the game. An interesting class of non-stationary opponents, is the class of adaptiveopponents. For example, in the repeated-game domain, we might assume that the opponent isa model-based agent (i.e., the opponent assumes that the player is a stationary DFA). In sucha case, the player can infer that the opponent strategy is a best-response for an automatonconsistent with the current history. This can be further extended to a general framework ofrecursive modeling, similar to the one described in Chapter 5, where a player holds a modelof the opponent which is also a modeling player. Thus, an n-level player holds a strategywhich is the best-response against (n� 1)-level opponent model.

Recursive models address the problems of how to acquire them and how to �nd the best-response against them. They provide a promising and interesting direction of extending themodel-based framework described in this work.

Dealing with opponents that try to learn your strategy: Assume that that your opponenttries to learn your strategy. How should you behave? Should you ignore and continue playing

90

Page 100: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

according to your strategy, or should you try to hide your behavior by taking misleadingactions?

In some cases, hiding your strategy is the key for success. In the bargaining game, forexample, if the seller determines the sequence of decreasing prices according to an \easyto learn" strategy, it will sell its products in the lowest price to buyers who succeeded inidentifying its bargaining strategy. On the other hand, there are many cases were exposingyour strategy is the right thing to do. A punishing policy for defection is only e�ective whenthe other side is aware to it.

The questions raised in this framework are how much and when to deviate from your truestrategy in order to full your opponents. Another question is how to do it optimally. Ahiding strategy has opposing goals to exploration strategy. It should evaluate the agent'smoves according to their contribution to the concealment of its strategy and to to hardeningof the recognition of it's plans.

Incorporating exploration strategies into multi-model search: Exploration strategies de-scribed in Chapter 3 enable the agent to explore the opponent's strategy by enforcing theopponent to respond to exploration actions. The experiments performed in the IPD domaindemonstrated that exploration behavior is essential for the success of the model-based agent.

An interesting direction for research is incorporating exploration strategies into multi-modelsearch. The M� algorithm is a best-response strategy that chooses the optimal action ac-cording to the current knowledge about the opponent's strategy. But it may also su�er fromthe same weaknesses as the model-based strategy without exploration. It can be stuck in alocal-minima and will not expose the actual opponent's strategy. A natural extension is to letthe player use a mixed model of its opponent's strategy, i.e., a set of models consistent with itspast behavior and a probability distribution over this set. By using mixed opponent models,the agent can evaluate its alternatives according to their contribution to the game (accordingto their Mn value) and also according to their contribution to the acquired knowledge aboutthe opponent's strategy, re ected by the expected posterior belief distribution over the set ofsupport.

By exploring the opponent's strategy, the agent may su�er from sub-optimal behavior at earlystages of the game but the knowledge acquired about the opponent's strategy is expected toprovide a better outcome in the future.

Implementing model-based strategies in a real-world application: The Internet and othercomputer networks provide solid examples for real-world multi-agent systems. It seems thatthe requirement for interaction strategies for agents operating in such environments will in-crease with the growth of these networks. Implementing model-based agents which operatein such networks can provide an excellent testbed for the ideas discussed in the thesis. Basedon the thesis results, there is a reason to believe that model-based strategies will contributea lot to the agents' performance.

Finally, an agent that interacts with other agents in MAS can bene�t signi�cantly from adaptingto the others. The model-based framework described in this work can serve as a general basis forfuture research on adaptive agents in multi-agent systems.

91

Page 101: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

Bibliography

[1] D. Abreu and A. Rubinstein. The structure of Nash equilibrium in repeated games with �niteautomata. Econometrica, 56, No. 6:1259{1281, 1988.

[2] D. Angluin. On the complexity of minimum inference of regular sets. Information and Control,39:337{350, 1978.

[3] D. Angluin. Learning regular sets from queries and counterexamples. Information and Com-

putation, 75:87{106, 1987.

[4] Robert J. Aumann and Michael B. Mashler. Repeated Games with Incomplete Information.The MIT Press, Cambridge Massachusetts, 1995.

[5] R. Axelrod. Evolution of strategies in the iterated prisoner's dilemma. In L. Davis, editor,Genetic Algorithms and Simulated Annealing, pages 32 { 41. Morgan Kaufmann Publishers,1987.

[6] Robert Axelrod. The Evolution of Cooperation. Basic Books, New York, 1984.

[7] Robert Axelrod. The Evolution of Strategies in the Iterated Prisoner's Dilemma. Cambridgeuniversity press, 1996.

[8] E. Ben-Porath. The complexity of computing a best response automaton in repeated gameswith mixed strategies. Games and Economic Behavior, 2:1{12, 1990.

[9] Hans Berliner. The B* tree search algorithm: A best-�rst proof procedure. Arti�cial Intelli-gence 12, 23-40, 1979.

[10] Donal A. Berry and Bert Fristedt. Bandit Problems, sequential allocation and experiments.Chapman and Hall, 1985.

[11] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall,1987.

[12] K. G. Binmore. Fun and Games. The MIT Press, 1990.

[13] Alan H. Bond and Les Gasser, editors. Readings in Distributed Arti�cial Intelligence. MorganKaufmann, 1988.

[14] H. H. Bui, D. Kieronska, and S. Venkatesh. Learning other agents' preferences in multi-agentnegotiation. In Proceedings of thirteenth National Conference on Arti�cial Intelligence (AAAI

96), pages 114{119, Portland, Oregon, August 1996.

92

Page 102: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

[15] David Carmel and Shaul Markovitch. Learning models of opponent's strategies in gameplaying. In Proceedings of the AAAI Fall Symposium on Games: Planning and Learning,pages 140{147, Raleigh, NC, October 1993.

[16] David Carmel and Shaul Markovitch. Incorporating opponent models into adversary search.In Proceedings of thirteenth National Conference on Arti�cial Intelligence (AAAI 96), pages120 { 125, Portland, Oregon, August 1996.

[17] David Carmel and Shaul Markovitch. Incorporating opponent models into adversary search.Technical Report CIS9607, Technion, March 1996.

[18] David Carmel and Shaul Markovitch. Learning models of intelligent agents. In Proceedings of

thirteenth National Conference on Arti�cial Intelligence (AAAI 96), pages 62 {67, Portland,Oregon, August 1996.

[19] David Carmel and Shaul Markovitch. Opponent modeling in multi-agent systems. In G. Wei�and S. Sen, editors, Adaptation and Learning in Multi-agent Systems, Lecture Notes in AI.Springer-Verlag, 1996.

[20] David Carmel and Shaul Markovitch. Exploration and adaptation in multi-agent systems:A model-based approach. In Proceedings of the �fteenth International Joint Conference on

Arti�cial Intelligence (IJCAI-97), pages 606 { 611, Nagoya, Japan, August 1997.

[21] David Carmel and Shaul Markovitch. Model-based learning of interaction strategies in multi-agent systems. Journal of experimental and theoretical Arti�cial Intelligence (JETAI) (to

appear), 1997.

[22] David Carmel and Shaul Markovitch. Exploration strategies for model-based learning inmulti-agent systems. Autonomous Agents Journal (to appear), 1998.

[23] David Carmel and Shaul Markovitch. How to explore your opponent's strategy (alnost)optimally. In Proceedings of the Third International Conference on Multi-agent Systems

(ICMAS-98), Paris, France, July 1998.

[24] David Carmel and Shaul Markovitch. Pruning algorithms for multi-model adversary search.Arti�cial Intelligence, 99(2):325 { 355, 1998.

[25] J. Christensen and R. E. Korf. A uni�ed theory of heuristic evaluation functions and itsapplication to learning. Proceedings of the �fth National Conference on Arti�cial Intelligence,1986.

[26] D. A. Cohn. Neural network exploration using optimal experimental design. In J. D. Cowan,G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6,pages 679{686. Morgan Caufmann, 1994.

[27] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press,Cambridge, Mass, 1990.

[28] Peter Dayan and Terrence J. Sejnowski. Exploration bonuses and dual control. Machine

Learning, 25 (1):5{22, 1996.

[29] R. O. Duda and P. E. Hart. Pattern Classi�cation and Scene Analysis. New York: Wileyand Sons, 1973.

93

Page 103: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

[30] Edmund H. Durfee and Je�rey S. Rosenschein. Distributive problem solving and multi{agentsystems: Comparisons and examples. In Proceedings of the 13th International Workshop on

Distributed Arti�cial Intelligence, pages 94{104, Seatle, WA, July 1994.

[31] V. Fedorov. Theory of optimal experiments. New-york: Academic Press, 1972.

[32] Tim Finin, Rich Fritzon, DonMcKay, and Robin McEntire. KQML { A language and protocolfor knowledge and information exchange. In Proceedings of the 13th International Workshop

on Distributed Arti�cial Intelligence, pages 126{136, Seatle, WA, July 1994.

[33] D. B. Fogel. Evolving behaviors in the iterated prisoners dilemma. Evolutionary Computation,1993.

[34] L. Fortnow and D. Whang. Optimality and domination in repeated games with boundedplayers. In Proceedings of the 25th Annual ACM Symposium on Theory and Computing,pages 741{749, 1994.

[35] Y. Freund, M. Kearns, D. Ron, R. Rubinfeld, R. E. Schapire, and Linda Sellie. E�cientlearning of typical �nite automata from random walks. In Proceedings of the 25th Annual

ACM Symposium on Theory and Computing, pages 315{324, 1993.

[36] Yoav Freund, Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeled, and Robert E.Schapire. E�cient algorithms for learning to play repeated games against computationallybounded adversaries. In Proceedings of the Annual Symposium on the Foundations of Com-

puter Science, pages 332{341, 1995.

[37] D. Fudenberg and D. Levine. Theory of learning in games. MIT press, 1997.

[38] I. Gilboa. The complexity of computing best response Automata in repeated games. Journalof economic theory, 45:342 {352, 1988.

[39] I. Gilboa and D. Samet. Bounded versus unbounded rationality: The tyranny of the weak.Games and Economic Behavior, 1:213{221, 1989.

[40] P. J. Gmytrasiewicz, E. H. Durfee, and D. K. Wehe. A decision theoretic approach to coordi-nating multiagent interactions. In Proceedings of the Twelfth International Joint Conference

on Arti�cial Intelligence (IJCAI 91), pages 62 { 68, Sydney Australia, August 1991.

[41] Piotr J. Gmytrasiewicz and Edmund H. Durfee. A rigorous, operational formalization ofrecursive modeling. In Victor Lesser, editor, Proceedings of the First International Conferenceon Multi{Agent Systems, pages 125{132, San Francisco, CA, 1995. MIT Press.

[42] E. M. Gold. Complexity of automaton identi�cation from given data. Information and

Control, 37:301{320, 1978.

[43] Claudia V. Goldman and Je�rey S. Rosenschein. Mutually supervised learning in multiagentsystems. In Gerhard Wei� and Sandip Sen, editors, Adaptation and Learning in Multi{Agent

Systems, Lecture Notes in Arti�cial Intelligence, pages 85{96. Springer Verlag, Berlin, 1996.

[44] J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages and Compu-

tation. Addison-Wesley, Mass., 1979.

[45] F. H. Hsu, T.S. Ananthraman, M.S. Campbell, and A. Nowatzyk. Deep thought. In T.A.Marsland and J. Schae�er, editors, Computers, Chess and Cognition, pages 55{78. SpringerNew York, 1990.

94

Page 104: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

[46] H. Iida, Jos W. H. M. Uiterwijk, H. J. van den Herik, and I. S. Herschberg. Potentialapplications of opponent-model search, part I: The domain of applicability. ICCA Journal,16(4):201{208, 1993.

[47] H. Iida, Jos W. H. M. Uiterwijk, H. J. van den Herik, and I. S. Herschberg. Potentialapplications of opponent-model search, part II: Risks and strategies. ICCA Journal, 17(1):10{14, Mar 1994.

[48] Hiroyuki Iida. Heuristic Theories on Game-Tree Search. PhD thesis, Tokyo University, 1994.

[49] Peter J. Jansen. Problematic positions and speculative play. In T. A. Marsland and J. Scha-e�er, editors, Computers, Chess and Cognition, pages 169{182. Springer New York, 1990.

[50] Peter J. Jansen. Using Knowledge about the Opponent in Game-tree Search. PhD thesis,Carnegie Mellon University, 1992.

[51] Leslie P. Kaelbling. Learning in embedded Systems. MIT Press, Cambridge, Mass, 1993.

[52] Ehud Kalai. Bounded rationality and strategic complexity in repeated games. In T. Ichiishi,A. Neyman, and Y. Tauman, editors, Game Theory and Aplications, pages 131{157. AcademicPress, San Diego, 1990.

[53] Ehud Kalai and Ehud Lehrer. Rational learning leads to Nash equilibrium. Econometrica,61(5):1019{1045, September 1993.

[54] Grigoris I. Karakoulas. Probabilistic exploration in planning while learning. In Proceedings

of the 11th Conference on uncertainty in Arti�cial Intelligence (UAI 95), pages 352 { 361,July 1995.

[55] Roni Khardon. Learning to take actions. In Proceeding of the thirteenth National Conference

on Arti�cial Intelligence (AAAI-96), pages 787 { 792, Portland, Oregon, 1996.

[56] V. Knoblauch. Computable strategies for repeated Prisoner's Dilemma. Games and EconomicBehavior, 7:381{389, 1994.

[57] D. E. Knuth and R. W. Moore. An analysis of alpha-beta pruning. Arti�cial Intelligence, 6,No. 4:293{326, 1975.

[58] R. E. Korf. Depth-�rst iterative deepening: An optimal admissible tree search. Arti�cial

Intelligence, 27:97{109, 1985.

[59] Richard E. Korf. Generalized game trees. In Proceeding of the International Joint Conference

on Arti�cial Intelligence (IJCAI 89), pages 328{333, Detroit, MI, August 1989.

[60] Richard E. Korf. Multi-player alpha-beta pruning. Arti�cial Intelligence, 48:99{111, 1991.

[61] Kevin J. Lang. Random DFA's can be approximately learned from sparse uniform examples.In Proceedings of the Fourth Annual ACM Conference on Computational Learning Theory,pages 45{52, 1992.

[62] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning.Proceedings of the eleventh International Conference on Machine Learning, pages 157{163,July 1994.

95

Page 105: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

[63] Michael L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Department ofComputer Science, Brown University, 1996.

[64] Michael L. Littman and J. A. Boyan. A distributed reinforcement learning scheme for networkrouting. Technical Report CMU-CS-93-165, Carnegie Mellon University, School of ComputerScience, 1993.

[65] R. D. Luce and H. Rai�a. Games and Decisions, Introduction and critical survey. John Wiley& Sons., 1957.

[66] C. A. Luckhardt and K. B. Irani. An algorithmic solution of n-person games. In Proceed-

ing of the Ninth National Conference on Arti�cial Intelligence (AAAI-86), pages 158{162,Philadel�a, PA, August 1986.

[67] Christophe Meyer, J. G. Ganascia, and J. D. Zucker. Learning strategies in games by antici-pation. In Proceedings of the �fteenth International Joint Conference on Arti�cial Intelligence

(IJCAI-97), pages 698 { 703, Nagoya, Japan, August 1997.

[68] M. S. Miller and K. E. Drexler. Markets and computation. In B. A. Huberman, editor, TheEcology of Computation, pages 133{176. Elsevier Science Publishers, (North Holland), 1988.

[69] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less dataand less time. Machine Learning, 13 (1), 1993.

[70] Yishay Mor, Claudia V. Goldman, and Je�ery S. Rosenschein. Learn your opponent's strategy(in polynomial time). In G. Wei� and S. Sen, editors, Adaptation and Learning in Multi-agent

Systems, Lecture Notes in AI. Springer-Verlag, 1996.

[71] M. Nowak and K. Sigmund. Tit-for-tat in heterogeneous populations. Nature, 355:250{253,January 1992.

[72] Christos H. Papadimitriou. On players with a bounded number of states. Games and Eco-

nomic Behavior, 4:122{131, 1992.

[73] Christos H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes.Mathematics of Operations Research, 12(3):441{450, 1987.

[74] D. Parkes and L. H. Ungar. Learning and adaptation in multi-agent systems. In AAAI-97

Workshop on Multi-agent Learning, 1997.

[75] Judea Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. AddisonWesley, 1984.

[76] L. Pitt. Inductive inference, DFAs and computational complexity. In K. P. Jantke, editor,Analogical and Inductive Inference, Lecture Notes in AI 397, pages 18{44. Springer-Verlag,1989.

[77] Aske Plaat. Research RE:search & RE-search. PhD thesis, Erasmus University, 1996.

[78] Anand S. Rao and Graeme Murray. Multi{agent mental{state recognition and its applicationto air{combat modeling. In Proceedings of the 13th International Workshop on Distributed

Arti�cial Intelligence, pages 283{304, Seatle, WA, July 1994.

[79] R. L. Rivest and R. E. Schapire. Inference of �nite automata using homing sequences. Infor-mation and Computation, 103, No. 2:299{347, 1993.

96

Page 106: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

[80] Dana Ron and Ronitt Rubinfeled. Exactly learning automata with small cover time. InProceedings of the Seventh Annual ACM Conference on Computational Learning Theory,pages 427{436, 1995.

[81] Moshe Tennenholtz Ronen I. Brafman. Modeling agents as qualitative decision makers. Ar-ti�cial Intelligence, (94) 1:217{268, 1997.

[82] J. S . Rosenschein and J. S. Breese. Communication-free interactions among rational agents:A probabilistic approach. In L. Gasser and M. N. Huhns, editors, Distributed Arti�cial

Intelligence Vol. II, pages 99{118. Morgan Kaufmann Publishers, 1989.

[83] A. Rubinstein. Finite automata play the repeated Prisoner's Dilemma. Journal of EconomicTheory, 39:83{96, 1986.

[84] Stuart Russell and Peter Norvig. Arti�cial Intelligence: A Modern Approach. Prentice Hall,1995.

[85] Arthur L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal,3:211{229, 1959.

[86] Arthur L. Samuel. Some studies in machine learning using the game of checkers II{recentprogress. IBM Journal, 11:601{617, 1967.

[87] T. W. Sandholm and R. H. Crites. Multiagent reinforcement learning and the iterated Pris-oner's Dilemma. Biosystems Journal, 37:147{166, 1995.

[88] Mitsuo Sato, Kenichi Abe, and Hiroshi Takeda. Learning control of �nite Markov chains withan explicit trade-o� between estimation and control. In IEEE Transactions on Systems, Man

and Cybernetics, volume 18 (5), September 1991.

[89] J. Schae�er, J. Culberson, N. Treloar, B. Knight, P. Lu, and D. Szafron. A world champi-onship caliber checkers program. Arti�cial Intelligence, 53:273{289, 1992.

[90] Sandip Sen, Mahendra Sekaran, and John Hale. Learning to coordinate without sharinginformation. In Proceedings of the Twelfth National Conference on Arti�cial Intelligence,pages 426{431, Seattle, WA, 1994.

[91] C. E. Shannon. Programming a computer for playing chess. Philosophical Magazine, 41:256{275, 1950.

[92] L. S. Shapley. Stochastic games. In Proceedings of the National Academy of Science of the

USA, volume 39, pages 1080{1087, 1953.

[93] M. J. Shaw and A. B. Whinston. Learning and adaptation in distributed arti�cial intelligencesystems. In L. Gasser and M. N. Huhns, editors, Distributed Arti�cial Intelligence Vol. II,pages 413{430. Morgan Kaufmann Publishers, 1989.

[94] Yoav Shoham and Moshe Tennenholtz. Emergent conventions in multi-agent systems: Initialexperimental results and observations (preliminary report). In Knowledge representation,

(KR92), 1992.

[95] Yoav Shoham and Moshe Tennenholtz. Co-Learning and the evolution of social activity. Tech-nical Report STAN-CS-TR-94-1511, Stanford University, Department of Computer Science,1994.

97

Page 107: for the degree of Do ctor of Science · vid Carmel Submitted to the Senate of the T ec hnion - Israel Institute of T hnology Hesh v an 5758 Haifa No em b er 1997. This researc hw

[96] Herbert A. Simon. Models of Bounded Rationality. The MIT Press, Cambridge Massachusetts,1983.

[97] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based onapproximating dynamic programming. In Proceedings of the 7th international conference on

Machine Learning, pages 216{224, San Mateo CA,, 1990. Morgan Kaufman.

[98] Milind Tambe. Agent and agent-group tracking in a real-time dynamic environment. In VictorLesser, editor, Proceedings of the First International Conference on Multi{Agent Systems,pages 368{375, San Francisco, CA, 1995. MIT Press.

[99] Milind Tambe and Ernesto Brodersohn-Ostrovich. Some challenges in tracking agent teams.In Milind Tambe and Piotr Gmytrasiewicz, editors,Working Notes of the AAAI-96 Workshop

on Agent Modeling, pages 83{89, Portland, OR, August 1996.

[100] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. Proceedingsof the tenth International Conference on Machine Learning, pages 330{337, 1993.

[101] Sebastian B. Thrun. The role of exploration in learning control. In David A. White andDonald Sopfge, editors, Handbook for Intelligent Control. Multiscience Press Inc., 1992.

[102] B. Trakhtenbrot and Y. Barzdin. Finite Automata: Behavior and synthesis. North HollandPublishing Company, Amsterdam, Holland, 1973.

[103] M. van der Meulen. Weight assessment in evaluation functions. In D. F. Beal, editor, Advancesin Computer Chess 5, pages 81{89. Elsevier Science Publishers, Amsterdam, 1989.

[104] C. J. C. H. Watkins and P. Dayan. Technical notes: Q-learning. Machine Learning, 8:279{292,1992.

[105] G. Wei�. Learning to coordinate actions in multi-agent systems. In Proceeding of the In-

ternational Joint Conference on Arti�cial Intelligence (IJCAI 93), pages 311{316, August1993.

[106] G. Wei� and S. Sen. Adaptation and Learning in Multi-agent Systems, Lecture Notes in AI

1042. Springer-Verlag, 1996.

[107] M. P. Wellman. A general-equilibrium approach to distributed transportation planning. InProceeding of the Tenth National Conference on Arti�cial Intelligence (AAAI-92), pages 282{ 289, 1992.

[108] Michael J. Wooldrige and Nicholas R. Jennings. Agent theories, architectures, and languages:A survey. In Intelligent Agents, Lecture Notes in AI 890. Springer-Verlag, 1995.

98