A Markov Decision Process (MDP) framework for active...

A Markov Decision Process (MDP) framework for active localization

Thesis submitted in partial fulfillment

of the requirements for the degree of

DEGREE

in

M.S by Research

by

Jyotika Bahuguna

[email protected]

Robotics Research Lab

International Institute of Information Technology

Hyderabad - 500 032, INDIA

May 2010

Copyright c© Jyotika Bahuguna, 2010

All Rights Reserved

International Institute of Information Technology

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “A Markov Decision Process (MDP) framework

for active localization” by Jyotika Bahuguna, has been carried out under my supervision and is not

submitted elsewhere for a degree.

Date Adviser: Dr. Madhava Krishna

Dedicated to my family and friends, who refused to give up on me.

Acknowledgments

I thank my guide Dr. Madhava Krishna for his constant support during myresearch from showing

flexibility to let me choose my research area to ensuring that I was making progress.

I thank my co-advisor Dr. Balaraman Ravindran, who has been an inspiration to me. His constant

pursuit for excellence forced me multiple times to raise my standards and i ended up learning a lot each

time. I will especially cherish the brain storming sessions with him.

Last but not the least i thank my family and friends who were always around whenever i needed them.

v

Abstract

Robot localization is the problem of estimating a robot’s coordinates relative to an external reference

frame, say a map. Given a map of its environment, the robot has to localize itself relative to this map by

consulting its sensor data. The task of the robot is to find outwhere it is, through sensing and motion.

Localization has proved to be the most critical aspect of mobile robot navigation. There are different

forms of the localization problem:Position tracking, which seeks to compensate small dead reckoning

errors (odometry errors) under the assumption that the initial robot pose is known, andglobal self-

localization, which addresses the problem of localization with no apriori information i.e. arobot is not

told its initial pose but instead has to determine it from scratch.

In environments which possess relatively few features that enable a robotto unambiguously deter-

mine its location, global localization algorithms can result in multiple hypothesesabout the location of

a robot. This is inevitable as the local environment seen by a robot repeats at several parts of the map.

Such a situation occurs commonly in indoor navigation such as in corridorsand office rooms. For effec-

tive localization the robot has to be actively guided to those locations where there is a maximum chance

of eliminating most of the ambiguous states. This task is often referred to as‘active localization’. Active

localization is based on the decision of ‘where to move’ and ‘where to look’so as to best localize the

robot. These actions mainly comprise of moving towards unique features inthe map (e.g., distinct obsta-

cles in a static environment), such that the sensor readings become unique and hence the robot localizes.

This thesis presents a learning framework for implementing active localization.In a given map, the

framework should be able to determine what actions to execute in a state, such that localization is ac-

celerated. Since global localization poses the problem ofperceptual aliasing, a belief state instead of a

state is used to describe a robot’s position in the map. Abelief-state MDP (Markov Decision Process)

is used to capture the mapping between the robot’s belief state(s) and action(s) which will maximize the

vi

vii

expected discounted rewards. The novelty of this work lies in the fact that itproposes a hierarchical

belief state framework for the MDP to perform the active localization task. Markov localization is used

typically to estimate a belief-state from a given set of robot’s sensor readings. This belief-state space

being extensive is represented by a yet another hierarchical belief-statespace, which makes the MDP

more tractable. Hence the framework enables to perform localization with accuracy of fine grid granu-

larity ( as in Markov localization) without having to handle large state space MDP(s). A mathematical

formulation in detail for the framework has been provided.

The feasibility of this framework is shown by testing it for different aspects of active localization task

. It is used to performmulti-robot active localization, where the framework not only learns to seek out

unique features in the map, but also learns to detect other robots, therebyfacilitating localization even

in the absence of map features. It is observed that multi-robot localizationalgorithm proves to be an

intuitive, semi-distributed algorithm which is able to attain an autonomy with respect to the number of

robots. That is to say, unlike a centralized algorithm that requires re-calculation from the scratch with

the change in the number of agents involved, this algorithm can be re-usedwithout re-training for a

range of number of robots. The analysis of localization performance is done by comparing it with a

learning framework which uses random walk.

The framework is also used to performtransfer of the learned localization policy from a robot with

higher sensor capabilities to the one with the lower sensor capabilities, thereby enabling the possibil-

ity of accelerating localization in a setup with heterogenous robots. The change in the sensor range

is considered as the change in observation function in MDP context and atransfer policy is designed

to transfer the knowledge between the two belief states MDP(s). The localization performance with

transfer is analyzed by comparing with performance , when learning is done from the scratch without

any transfer. It is observed that transfer leads to a better learning rate asthe performance reaches an

optimum in lower number of training iterations. One of the observation also shows the effect of map

type on transfer and suggests the fact that the advantages of transfer is more pre-dominant in the case

of sparse maps than the dense maps. The contribution here lies in attemptingtransfer across belief

states MDP(s), since transfer has been limited to just MDP(s) till now. Also thetransfer aspect is being

applied and analyzed for active localization for the first time.

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Passive Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Active Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 31.2.1 Belief-state MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Problem statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . .. . 51.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6

2 Framework Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Markov Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Function approximator (CMAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92.3 Solving MDP(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.4 Action space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.5 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.6 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162.7 Framework formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

3 Multi-robot localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 223.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Simulation Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

3.6.1 On different kind of maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6.2 With different number of robots . . . . . . . . . . . . . . . . . . . . . . . . . 283.6.3 With different cell sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6.4 With different kind of clustering . . . . . . . . . . . . . . . . . . . . . . . . . 333.6.5 Comparison with centralized method . . . . . . . . . . . . . . . . . . . . . . 34

3.7 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .353.7.1 Self balancing property of the framework . . . . . . . . . . . . . . . . . . .. 36

viii

CONTENTS ix

4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 394.4 Background on Transfer learning . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 394.5 Transfer scenario in our case . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 424.6 Formalism for transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 454.7 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.8 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.9 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

4.9.1 Transfer Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.9.2 Task specific trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1 Limitations and Scope of future work . . . . . . . . . . . . . . . . . . . . . . . . .. 56

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

List of Figures

Figure Page

1.1 An example of global localization using sonar[1] . . . . . . . . . . . . . . . .. . . . 21.2 Example situation that shows a typical belief state during global localization ina locally

symmetric environment [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Probability density function of robot position in continuous and discrete maps. (a) Con-

tinuous map with single hypothesis, (b) Continuous map with multiple hypothesis, (c)Discretized map with probability distribution and (d) Discretized topological map withprobability distribution [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Map discretized into grids by Markov Localization . . . . . . . . . . . . . . . .. . . 82.2 Multimodal distribution of belief returned by the passive Markov localization. W is a

wall at the top disconnected from the rest of the environment. The actual robot positionis shown by a thick black dot (third cluster from left on top and labeled R. . .. . . . . 9

2.3 Cell built over the fine grid structure in a map . . . . . . . . . . . . . . . . . . . .. . 102.4 CMAC with One-dimensional input and output. The first set of layers receiving the

input are tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 112.6 Grid world example. Agent should learn to reach the goal state G . . . . . .. . . . . . 122.7 Reward distribution on transitions among the grids . . . . . . . . . . . . . . . . .. . 122.8 The value function solution for the Grid world example. The number in eachgrid depicts

the goodness value of that grid i.e, how beneficial it is to be in that grid/state.. . . . . 142.9 The Q value function solution for the Grid world example. The number in each grid

depicts the goodness value being in that state and the action executed as shown by arrow. 142.10 Action space for the robot. (Axis denote the actions) . . . . . . . . . . . .. . . . . . 15

3.1 Cubicles in an office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223.2 Trees in an open space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 223.3 Scene of a warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 233.4 Initial hypotheses of the robot . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 263.5 Sensor scan is obtained from each virtual position . . . . . . . . . . . . . .. . . . . . 263.6 Execute the decided action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 273.7 Robot localized (Only one hypothesis left) . . . . . . . . . . . . . . . . . . . .. . . . 273.8 Map A (arrows show the direction preferred by the robots) . . . . . . .. . . . . . . . 283.9 Map B (arrows show the direction preferred by the robots) . . . . . . .. . . . . . . . 293.10 Trend showing the average number of localization iterations Vs Number of robots . . . 293.11 The position of the robot refered to in table 3.1 . . . . . . . . . . . . . . . . .. . . . 30

x

LIST OF FIGURES xi

3.12 Hypotheses and frontiers of the robot refered to in table 3.1 . . . . . .. . . . . . . . . 323.13 Localization performance with respect to cell sizes . . . . . . . . . . . . .. . . . . . 333.14 Trend showing the trend of localization performance Vs cell size (MapB) . . . . . . . 343.15 Localization performance Vs Clustering methods for various cell sizes(Map A) . . . . 353.16 Number of iterations to localize/robot compared for semi-distributed and centralized

methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.17 State transitions (red numbers above the arrows indicate the action taken) . . . . . . . 373.18 Variation in the Q(a) behavior with respect to the state transition . . . . . . .. . . . . 37

4.1 Difference in sensing ranges of Sonar and Laser Range Finders .. . . . . . . . . . . . 404.2 Robot RA with a high sensor range . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 The belief state for the Robot RA with a high sensor range . . . . . . . . . . . . . . . 484.4 Robot RB with a low sensor range . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5 The belief state for the Robot RB with a low sensor range . . . . . . . . . . . . . . . . 494.6 Performance Comparison: (With transfer learning Vs Without transferlearning) . . . . 514.7 Performance Vs Ratio of Source and Target sensor ranges of the robot . . . . . . . . . 534.8 Maps: Relatively cluttered map (top) , Sparse Map(Bottom) . . . . . . . . . .. . . . 534.9 Learning curves of Map1 (Relatively cluttered top) Vs Map2 (SparseMap) . . . . . . 54

List of Tables

Table Page

3.1 For robot position (225,150) as shown in 3.12, rewards and V(b′

) with respect to the cellsize. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 For robot position (225,150) as shown in 3.11, Q-values and action selected with respectto the cell size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Localization performance with respect to change in cell sizes for Map Aas shown infigure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Localization performance with respect to change in cell sizes for Map B. . . . . . . . 333.5 Localization performance with respect to change in cell sizes and clustering meth-

ods(Map A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Key for Allowed task differences, with reference to table 4.6 . . . . . . .. . . . . . . 414.2 Key for Transferred Knowledge, with reference to table 4.6 . . . . . .. . . . . . . . . 424.3 Key for Source Task Selection, with reference to table 4.6 . . . . . . . . .. . . . . . . 424.4 Key for Allowed Learners, with reference to table 4.6 . . . . . . . . . . . .. . . . . . 434.5 Key for TL Metrics, with reference to table 4.6 . . . . . . . . . . . . . . . . . .. . . 434.6 TL methods for transfer across same state variables and actions. N/A implies the ab-

sence of any task mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.7 Q-values for RB with transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.8 Q-values for RB without transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

xii

Chapter 1

Introduction

Robotics is the science of perceiving and manipulating the physical world through computer-controlled

devices. Examples of successful robotic systems include mobile platforms for planetary exploration, in-

dustrial robotics arms in assembly lines, cars that travel by themselves, andmanipulators that assist

surgeons. Robotics systems are situated in the physical world, perceiveinformation on their environ-

ments through sensors, and manipulate through physical forces [1].

In order for a robot to operate autonomously, robots should be able to accomodate large amounts

of uncertainity in the physical world. The uncertainity might be in the environment around, robot’s

sensors, its actuation or its software and algorithmic approximations. For more generalized applications

where these uncertainities cannot be engineered robots will be forced totake right decisions inspite of

the uncertainity in their sensors and internal models. Thisproblem is becomingmore relevant with the

need of more and more robotic applications in the external environments.

Probabilistic robotics is a relatively new approach to robotics that pays tribute to the uncertainty in

robot perception and action. The key idea in probabilistic robotics is to represent uncertainty explicitly

using the calculus of probability theory. In other words, instead of relyingon a single “best guess” as

to what might be the case, probabilistic algorithms represent information by probability distributions

over a whole space of guesses. By doing so, they can represent ambiguity and degree of belief in a

mathematically sound way [1].

1.1 Localization

In applications which require reliable navigation and mapping, determining the robot’s pose becomes

crucial. Mobile robot localization is a problem of determining the pose of a robot relative to a given

map of the enviornment. Its often calledposition estimation, which is done by consulting the sensor

readings of the robot. Unfortunately the robot’s pose cannot be estimatedfrom a single sensor mea-

surement, either because of the noisy sensor data or/and nature of the map(has similar sub-parts). Not

every localization problem is equally hard. It depends on the nature of theenvironment and the initial

1

that a robot may possess with respect to the localization problem. The types of localization problems

depending on the type of knowledge available aprior and at run-time are listed as below.

Position tracking – which seeks to compensate small dead reckoning errors (odometry errors) under

the assumption that theinitial robot pose is known. Localizing the robot is done by accommodating the

noise in robot motion. This is based on the assumption that the pose error is small. The pose uncertainty

is often approximated by a unimodal distribution (e.g., a Gaussian). The position tracking problem is

a local problem – often calledlocal localization– since the uncertainty is local and confined to region

near the robot’s true pose.

Global self-localization– which addresses the problem of localization with no apriori information

i.e., a robot is not told its initial pose but instead has to determine it from scratch. An example of global

localization using robot with sonar sensors is shown in figure 1.1. This problem is more difficult, since

the error in the robots estimate cannot be assumed to be small. Consequently, arobot should be able to

handle multiple, distinct hypotheses – making unimodal probability distributions usually inappropriate.

Figure 1.1An example of global localization using sonar[1]

1.1.1 Passive Localization

In passive localization, the localization module onlyobservesthe robot operating. The robot is

controlled through some other means, and the robot’s motion is not aimed at facilitating localization.

For example, the robot might move randomly or perform its everyday tasks.

1.1.2 Active Localization

This set of localization problems are based on the fact that the localization algorithm controls the

motion of the robot.Active Localization algorithms control the robot so as to minimize the localization

error and/or the costs arising from moving a poorly localized robot into a hazardous place.

2

Active approaches to localization usually yield better localization results than passive ones. The

active localization is based on the decision of ‘where to move’ and ‘where tolook’ so as to best localize

the robot.

In environments which possess relatively few features that enable a robot to unambiguously deter-

mine its location,global localization algorithmscan result in‘multiple hypotheses’locations of a robot.

This is inevitable with global localization algorithms, as the local environment seen by a robot repeats

at other parts of the map. Such a situation occurs commonly in indoor navigationsuch as in corridors

and office rooms. An example of the same is shown in figure 1.2 – a robot is located in a symmetric

corridor, and its belief is centered at two (symmetric) poses. Thus, for effective localization, the robot

has to be actively guided to those locations where there is a maximum chance ofeliminating most of the

ambiguous states – which is often referred to as‘active localization’.

Figure 1.2 Example situation that shows a typical belief state during global localization in a locallysymmetric environment [1]

Such a scenario involving‘multiple hypotheses’can be represented and be worked upon with the

help of probabilistic framework of Markov Localization. In Markov Localization, the robot maintains a

probability density (belief) over the entire configuration space of the robot.The probabilistic paradigm

represents the robot’s momentary belief by a probability density function over the space of all locations.

This is illustrated in figure 1.3. Itshowsbelief representationregarding the robot position (1D) in contin-

uous and discrete maps: (a) Continuous map with single hypothesis, (b) Continuous map with multiple

hypothesis, (c) Discretized map with probability distribution and (d) Discretized topological map with

probability distribution.

1.2 Markov Decision Process (MDP)

A MDP is defined as a 4-tuple(S,A,P,R), in which S is the finite set of states, A is the finite set

of actions, P is the transition probability function and R is the reward function.If the state and action

spaces are finite, then it is called a finite Markov decision process (finite MDP). The dynamics of the

3

Figure 1.3Probability density function of robot position in continuous and discrete maps. (a) Continu-ous map with single hypothesis, (b) Continuous map with multiple hypothesis, (c) Discretized map withprobability distribution and (d) Discretized topological map with probability distribution [2]

enviornment is defined by the transition probability function

P : S × A × S → [0, 1] (1.1)

The reward function is defined as a real value bounded function

R : S × A × S → R (1.2)

A MDP satisfies the Markov property, which states that one-step dynamics can predict the next state

and expected next reward given the current state and action. The objective is to find a value-function

V(s), which is a function of states that estimates how good it is for the agent to be in agiven state. The

notion of ”how good” here is defined in terms of future rewards that can be expected, or to be precise,

in terms of expected rewards.Of course, the rewards that the agent canexpect depend on what actions

it will take. Accordingly, value functions are defined with respect to particular policies. A policy,π, is

a mapping from each state,s∈ S and actiona ∈ A, to the probabilityπ(s,a) of taking actiona in state

s.[3] A value function hence,Vπ(s) determines the value of states under the policyπ, i.e the expected

return when starting ins and following the policyπ thereafter. The solution of the value function is

given by the Bellman’s optimality equation. For alls∈ S:

V (s) = maxa∈A(Σs′Pass′ [R

ass′ + γV (s′)]) (1.3)

The Bellman equations can be solved using iterative algorithms, where we start with an initial guessV0

and iterate for every states:

Vk+1(s) = maxa∈A(Σs′∈SP ass′ [R

ass′ + γVk(s

′)]) (1.4)

As k goes to infinity, Vk converges to the optimal policy, V∗.

4

1.2.1 Belief-state MDP

A POMDP (Partially Observable Markov Decision Process) model decisionproblems in which an

agent tries to maximize its reward in the face of limited and/or noisy sensor feedback, i.e state is only

partially observable. A POMDP is a tuple( S,A,T,R,O,ρ ) where S is a set of states, A a set of actions

andρ a set of observations. The functions T and R define a MDP with which the agent reacts without

direct information as to the current state. The agent’s decisions are made based on information from its

sensors (observations) formalized by:

O : S × A × S → Π(ρ) (1.5)

The algorithms to solve a POMDP are computationally intensive and hence a POMDP is generally

converted to a belief-state MDP. A belief state is a probability distribution over allthe states of the

underlying environment. The notationb(s) is used to indicate the agent’s belief that it is in stateswhen

current belief state isb ∈ Π (S). Using the model, belief states can be updated based on the action’s

observations in a way that makes the beliefs correspond exactly to state occupation probabilities.

All future references to MDP should be assumed as a belief-state finite MDP, unless specified.

1.3 Problem statement and Contributions

The objective of this thesis is to design a learning framework for the problemof active localization.

The frameworkshould be able to handle some important aspects of active localization. The advantage

of a such a framework is finding applications of research in the learning domain in active localization

context. The framework covers ground on two aspects of active localization. Initiating from localization

of a single robot, it is able to take advantage of multiple robots in the environmentand accelerate the

localization in a distributed algorithm. Secondly, the concept of transfer in MDP(s) is successfully used

to transfer the knowledge of ”how to localize” among robots with different sensory capabilities for a

given map. Hence the thesis deals with two sub-problems:

1. Given a workspace consisting of ’n’ robots, devise a learning policy thatsuch localization is

accelerated by taking the advantage of unique map features and robot-robot detections, given a

map.

2. Given two robots with different sensor ranges, transfer the policy to localizefrom one robot to

another , given a map.

Since temporal relations to be modelled in the above cases is relative, a MDP is an ideal choice.

Also no information is available apriori, a condition of perceptual aliasing(many different places can be

percieved to be same) is imposed. Each of this percieved place can be thought of as a hypothesis. A

5

belief state, hence is a probability distribution over these hypotheses. A belief state MDP can be used to

incorporate this uncertainity. Markov localization technique is used for belief state estimation.

In general, work on active localization has been limited as compared to passive localization. The

only work known onactive localization of multiple robots used a centralized algorithm that used robot-

robot detections for an optimal way to localize in minimum number of steps[4]. Being a centralized

algorithm, it suffers the disadvantage of increase in the number of computations with increase in the

number of robots. A learning framework introduces the option to use a distributed algorithm to perform

the same task. Using a learning framework to perform activelocalization formultiple robots is a novel

approach as far as the knowledge of the author goes. A detailed analysisof implications of using such a

framework for the task is an added contribution.

Also, the idea of re-using the localizing knowledge across robots with different sensor ranges presented

in the thesis is novel. This bit of research is also significant from the machinelearning point of view,

since till date transfer was limited to among MDP(s) and not belief MDP(s).

1.4 Organization of thesis

The present chapter serves as an introduction. Section (1.1) gives a background about the problem of

localization and the motivation behind the need to solve it. Section (1.2) introduces the generic Markov

Decision Process (MDP) and the kind of problems that can be solved usingthem. Section (1.3) puts

forth the problem statement and discuss about the contributions of this thesis. Section (1.4), describes

the organization of this thesis.

Chapter 2 describes the framework design. Section 2.1 describes the methodof Markov localization,

which is central to the framework. Section 2.2 describes the function approximator Cerebellar Model

Articulation Controller(CMAC) and its role in the framework. Section 2.3 describes the working of a

typical MDP in detail. Section 2.4 , 2.5 and 2.6 introduce the concepts of MDP action space, rewards

and observations in the active localization context. Section 2.7 formulates the framework from the con-

solidated knowledge of all the above mentioned sections.

Chapter 3 describes how the problem of active localization involving multiple robots can be solved

using the MDP framework. Section 3.1 states the problem statement. Section 3.2 discusses the previous

work done in the area. Section 3.3 discusses the problem statement in detail. Section 3.4 explains the

algorithm for multi-robot active localization. Section 3.5 explains the various stages of algorithm with

the help of snapshots. Section 3.6 explains the various experiments done to analyze the method. Section

3.7 describes some interesting observations made during the experiments.

Chapter 4 describes the concept of transferring the knowledge of ”how to localize” across the robots

with different sensor capabilities. Section 4.1 states the problem statement. Section 4.2 describes the

6

related work with transfer learning and active localization context. Section 4.3 describes the problem

statement. Section 4.4 gives a background on transfer learning methods , metrics and parameters. Sec-

tion 4.5 explains the scenario of transfer handled in the thesis. Section 4.6 and 4.7 explains the algorithm

to perform the transfer with the existing framework. Section 4.8 explains the above with the help of an

example. Section 4.9 describes the experiments performed to verify transfer and results obtained.

Finally chapter 5 gives the concluding remarks provides pointers for the further research.

7

Chapter 2

Framework Design

This chapter explains the framework details. A belief state MDP requires belief state to be estimated

from a set of sensor readings. This is done by the means of passive Markov localization.

2.1 Markov Localization

Markov localization discretizes the map into grids of adequately small size, so that localization

within the grid is unimportant. Each grid is represented by a(x,y) value, which can be the centre of the

grid. The discussed structure is shown in figure 2.1 and 3.4. The sensorreadings obtained by the robot

Figure 2.1Map discretized into grids by Markov Localization

is compared with sensor readings that will be obtained in each of this virtual grid position. And a prob-

ability distribution is generated on the basis of the similarity of the actual sensor readings and virtual

sensor readings. The probabilities only above a certain threshold are considered, which lead to cluster

like formations, each of which indicates a virtual position (hypothesis) of therobot. This is shown in

8

the figure (2.2).

Active localization involves moving in directions such that this distribution becomes unimodal, i.e

Figure 2.2 Multimodal distribution of belief returned by the passive Markov localization.W is a wallat the top disconnected from the rest of the environment. The actual robot position is shown by a thickblack dot (third cluster from left on top and labeled R.

robot is no longer confused about its position. The direction always points to one of the surrounding

frontiers from a hypothesis location. These frontiers at each hypothesis location are computed as in [5].

The reward on reaching the frontier is calculated by re-estimating the belief state taking into account

the new information and then subtracting the entropies of the former and later belief states. A positive

reward indicates reduction in entropy (strengthening the belief about robot’s position).

The aim of the framework is to estimate a mapping between the belief states and the direction to

move ,such that the expected reward is maximized in the long run. Ideally the entire grid framework

should compose the input belief state for the MDP. But due to larger number of grids, belief state

space would be extensive and update operations with respect to the readings from each grid renders the

MDP intractable. Markov localization applies clustering over the probability distributions and make the

hypotheses enumeration more manageable. Clusters could be used as an input for the MDP, but number

of clusters that can be formed might vary depending on the position in the map.Hence, the requirement

for a coarser grid-like framework over the clusters.

A hierarchy of grids over this underlying fine grid structure is built , whichis called acell. The concept

of a cell is shown in the figure 2.3 The cell structure is much coarser in granularity ascompared to the

grid structure and hence makes a MDP much more feasible. The state space of the MDP now comprises

of probability distributions across the cells.

2.2 Function approximator (CMAC)

As stated in the previous section, the aim is to find a mapping between the belief states and the

direction to move, which in turn is decided by which action to execute. Since the state space is not dis-

9

Figure 2.3Cell built over the fine grid structure in a map

crete, (i.e there can be infinite combinations of probability distributions among thecells), its impossible

to enumerate all the possible belief states. Hence, a function approximator is used to approximate the

mapping (the value function V(b) ).

A CMAC (Cerebellar Model Articulation Controller) is a sparse coarse-coded associative memory algo-

rithms that mimic the functionality of the human cerebellum. In a CMAC, the input space is quantised

using a set of overlapping tiles as shown in refcmac-tiles For input spacesof high dimensionality, the

tiles form hypercubes. A query is performed by first activating all the tilesthat contain a query point.

The activated tiles in turn activate memory cells which contain stored values; theweights of the sys-

tem. The summing of these values produces an overall output. A change in theinput vector results in a

change in the set of activated tiles, and therefore a change in the set of memory cells participating in the

CMAC output. More details about the CMAC implementation can be found in [6].

2.3 Solving MDP(s)

A MDP is defined as a 4-tuple(S,A,P,R), in which S is the finite set of states, A is the finite set of

actions, P is the transition probability function and R is the reward function. Ifthe transition probabilities

are known, solving a MDP becomes a straightforward computational problem, however if they are

unknown then it is a problem for reinforcement learning.

Consider an example:

An agent (or robot) exists in a certain environment. The environment consists of states, and the agent

moves between states in this environment. Based on the current state information the agent decides

10

Figure 2.4CMAC with One-dimensional input and output. The first set of layers receiving the input aretiles.

which state to move to next. Depending on the agent’s action, the environment returns a new state

information and some reward. The goal in this problem is to maximize the agent’s reward. The above

Figure 2.5Reinforcement learning

interaction between the agent and the environment can be described as shown in 2.3 Here, the agent

starts at state s0 and initially executes action a0. This gets the agent reward r1 and takes him to state s1.

At state s1, he executes action a1, which gets him reward r2 and takes him tostate s2. And so on.

Define a policyπ : S→ A to be a mapping between what has happened in the past and what has to be

done at the current state. S is the set of states and A the set of possible actions. The goal is to find a

policy that maximizes the reward. We define Vπ(s) to be the value of policyπ when starting at state s.

We can write:

V π(s) = Eπ[rt+1 + rt+2 + rt+3 + ...|st = s] (2.1)

at = π(st) (2.2)

11

The sum in equation 1 is typically in

nite. In a more realistic setting the rewards in the future get discounted, and equation 2.3 becomes:

V π(s) = Eπ[rt+1 + λrt+2 + λ2rt+3 + ...|st = s] (2.3)

whereλ ∈ [0,1) is the decay factor.

Consider the agent that moves in the grid shown in 2.6, let’s say deterministically. If it moves to the goal

state G, it gets reward = 100. If it moves anywhere else it gets reward = 0.So, the picture becomes: Let’s

Figure 2.6Grid world example. Agent should learn to reach the goal state G

also assume that the decay factor = 0:9. The reward distribution is shown in 2.7. Before we address the

Figure 2.7Reward distribution on transitions among the grids

issue of finding a policy, we look at how to evaluate a known policy B at a given state st . We know that

12

from equation 2.3 that:

V π(s) = Eπ[rt+1 + λrt+2 + λ2rt+3 + ...|st = s] (2.4)

= Eπ[rt+1|st] + λEπ[rt+2 + λrt+3 + ...|st] (2.5)

= Eπ[rt+1 + λV π(st+1)|st] (2.6)

The next state function is :

P (s′

|s, a) = Pr[st+1 = s′

|st = s, at = a] (2.7)

Where it is the probability of landing up in state s′

at timet+1, when the initial state at timet is st and

action executed at timet is at.

The expected reward is :

r(s, a) = E[rt+1|st = s, at = a] (2.8)

Where its the reward at timet+1, when the earlier state at time instantt was st and action executed was

at. Hence the equation 2.6 becomes:

V π(s) = r(s, a) + λ∑

s′

P (s′

|s, a)V π(s′

) (2.9)

Which, as mentioned in the earlier chapter is the Bellman’s equation. Its a linear equation withn

unknowns, wheren is the number of states, and it has a unique solution. Another way is to solve it iter-

atively as shown in equation 1.4. Unfortunately, neither solution is satisfying. For a simpler approach,

we can experiment with the environment to find an approximate to the solution. This is done during the

training phase of the MDP. We start with some initial value of value function forall states. We choose

to transit to some next state either chosen at random or by some look ahead of gain that the transition

will yield (exploration or exploitation respectively). The reward for the action executed is used to adjust

the goodness value for the state. This procedure is repeated for for different state-state transitions. This

constitutes the training phase.

The value function Vπ for the above grid problem is found as shown in 2.8. An agent in the upper

middle grid knows that it should move right to reach the goal state.

Lets, describe the Q value function, Q∗(s,a) is the value of starting at states, executing actiona and

from there on following optimal policyπ∗. That is:

Q∗(s, a) = E[rt+1 + λV ∗(st+1)|st = s, at = a] (2.10)

The iterative version of the equation 2.10 is:

Q∗(s, a) = r(s, a) + λ∑

s′

P (s′

|s, a)V π(s′

) (2.11)

13

Figure 2.8 The value function solution for the Grid world example. The number in each grid depictsthe goodness value of that grid i.e, how beneficial it is to be in that grid/state.

Figure 2.9The Q value function solution for the Grid world example. The number in each grid depictsthe goodness value being in that state and the action executed as shown by arrow.

Intuitively, if Q∗ is known, the optimal policyπ∗ is known. Q∗ can be calculated from Vπ.

π∗(s) = argmaxaQ∗(s, a) (2.12)

Q∗ for the grid example is shown in 2.9.

The example explained above describes how to solve a typical MDP. The Bellman’s equation (2.9)

is used when the model of an environment is known. For environments for which the dynamics are

unknown, a Temporal Difference (TD) learning algorithm can be used.In case like ours, where the

reward and the next-state probability distributions are not known apriori, TD version of value function

is used as shown in (3.2).

V (s) = α(R + γ × V (s′

) − V (s)) + V (s) (2.13)

The obvious advantage of TD methods are that they are naturally implemented inan on-line and fully

incremental fashion. It has been proved, that for any fixed policyπ, a TD algorithm will converge to

Vπ, in the mean for a constant step-size parameter, if it is sufficiently small.

14

2.4 Action space

The action space for the belief state MDP consists of 4 actions, namely Up, Down, Left, Right. The

next position to move to, on executing the action depends on frontier found inthat direction. Imagine

the robot’s scanned area being divided into 4 quadrants as shown in figure 2.10. The frontier that lies in

Figure 2.10Action space for the robot. (Axis denote the actions)

theUP-RIGHTquadrant is closer to theUP axis and is assigned toUP action. Similarly the frontier that

lies in theRIGHT-DOWNquadrant , being closer to theDOWNaxis is assigned to theDOWNdirection.

On executing an action, the corresponding frontier is reached. The convention followed in the thesis is:

0 = UP, 1 = DOWN, 2 = LEFTand 3 =RIGHT.

2.5 Rewards

The belief states indicate the entropy about a robot’s position. Hence, reward is a measure of re-

duction in entropy (gain in information) on taking an action. The entropyH is calculated by equation

2.14

H = −n

∑

i=1p(xi) ∗ logbp(xi) (2.14)

Let B be a belief state,

B = [b(s1), b(s2)....b(sn)] (2.15)

With respect to the belief states, the entropy can be calculated as:

HB = −n

∑

i=1b(si) ∗ logbb(si) (2.16)

where HB = entropy indicated by the belief stateB.

b = base e or 10 or 2

15

GainInfo = HB′ − HB (2.17)

where B′

is the next belief state reached after transition from B.

2.6 Observations

The observations in a belief-state MDP play the role of limiting the possible next belief states after

executing an action by directly calculating the next belief state. The observations in our case are the

sensor reading obtained by the robot.

2.7 Framework formulation

The framework uses a hierarchical belief state model as described in section Markov localization

of this chapter. A belief state is estimated at the grid as well as the cell level. Theinteraction between

the levels (cell and grid) is such that, the belief state update occurs at the grid level but the rewards

and transition function are computated at the cell level. If a MDP is assumed atthe grid level, with the

following parameters :

MDPGrid =< BGrid, A, P, R, O, T > (2.18)

wheres in B(s) represents a grid

BGrid(s) represents the probability of being in grids .

BGrid is the belief state representing the probability of being in each grid. i.e

BGrid = [bGrid(s1), bGrid(s2)....bGrid(sn)] (2.19)

where bGrid(s1) = probability that the robot is in grid s1

bGrid(sn) = probability that the robot is in grid sn.

A = action space, consisting of 4 basic actions< Up, down, Left, Right>

T = Transition probability i.e Pr( B′

Grid | BGrid, a)

Ω = set of observations obtained.

R = Reward

O = Observation function.

The transition function from belief state BGrid to B′

Grid using an actiona is deterministic given an

observationo and defines theT, transition function. It is also denoted as :

B′

Grid = T (BGrid, a, o) (2.20)

where,

16

Experimentally, the computation of the belief state over the grids is done as shown :

Pr(Robot in grid|Readings) =Pr(Readings|Robot in grid) · Pr(Robot in grid)

∑

Pr(Reading|Robot in grid) · Pr(Robot in grid)(2.21)

where Pr( Readings| Robot in grid) = Probability of getting the readings,Readings, when the robot

is in grid, say s1.

Pr( Robot in grid| Readings) = Probability of robot being in grid s1, when readings obtained are

Readings.

If there is no prior belief about robot position, the prior Pr( Robot in Grid) is irrelevant. Hence

the equation 2.21 becomes:

Pr(Robot in grid|Readings) =Pr(Readings|Robot in grid)

∑

Pr(Reading|Robot in grid)(2.22)

which is basically the probability of obtainingReadingsin grid s1 normalized over the probability

of observingReadingsover all grids( Markov localization ).

Hence,

BGrid(s1) =Pr(Readings|Robot in grid s1)

∑

grids Pr(Reading|Robot in grid s1)(2.23)

and Belief state BGrid(s) is a vector of beliefs for all grids as shown in equation 2.19.

The belief state on grid level makes a MDP intractable due to extensiveness of the belief state space.

As described in the earlier sectionMarkov Localization, a hierarchical cell level is constructed over the

fine grid structure to make the belief states manageable.

A belief-state MDP over the cells is described as:

MDPCell =< BCell, A, P, R, O, T > (2.24)

wheres in BCell(s) represents a cell and the later represents the probability of being in cells. Markov

localization results in formation of clusters of grids to represent a robot’s various virtual positions or

hypotheses. A cluster’s mean and the probability forms the position and probability of the corresponding

hypothesis. Since virtual position denoted by the cluster centroid is alreadycalculated and so is the

probability of the virtual position, the belief state on cell level is a mere renormalization of probabilities

on the grid level.

BCell(si) =Probability assigned to the centroid lying in the cell si

∑

siProbability assigned to the centroid lying in the cell si

(2.25)

BCell = [BCell(s1), BCell(s2)....BCell(sn)] (2.26)

17

The next belief state update on an action execution is done on the grid level again.

B′

Grid(s′

) =O(a, s

′

, o)∑

s BGrid(s)Pr((s′

)|s, a)

Pr(o|BGrid, a)(2.27)

and

pr(o|BGrid, a) =∑

s∈S

BGrid(s) · Pr((s′

)|s, a)∑

s′

O(a, s′

, o) (2.28)

The equation 2.28 is derived as shown:

For belief-update in a grid like framework,

where Bel( st−1 ) = Belief state at time t-1

st−1 = state at time t-1

st = state at time t

Pr(o|Bel(st−1), a) =∑

stPr(o|st, Bel(st−1), a) · Pr(st|Bel(st−1), a)

=∑

st

Pr(o|st) · Pr(st|Bel(st−1), a)

(2.29)

Now, Pr( st | Bel(st−1),a) is,

Pr(st|Bel(st−1), a) =∑

st−1Pr(st|st−1, Bel(st−1), a) · Pr(st−1|Bel(st−1))

=∑

st−1

Pr(st|st−1, a) · Pr(st−1)

(2.30)

From derivations 2.29 and 2.30,

Pr(o|Bel(st−1), a) =∑

st

Pr(o|st) ·∑

st−1

Pr(st|st−1, a) · Pr(st−1) (2.31)

Refer to the equation 2.28 above.

After the next belief state update, the belief state at cell are computed using the equation 2.25. To ex-

press the transition probability in the terms of observation functions, we use the conditional probability

axiom:

Pr(A|B, C) =∑

D

Pr(A|B, C, D) · Pr(D|B, C) (2.32)

18

The transition probability at the cell level is given as :

Pr(B′

Cell|BCell, a) =∑

o∈Ω

Pr(B′

Cell|BCell, a, o) · Pr(o|BCell, a) (2.33)

Where o = set of observations obtained when in belief state B′

Cell. The transition function formulates

as:

Pr(B′

Cell|BCell, a, o) = 1 If, B′

Cell → f(BCell, a, o) (2.34)

= 0 Otherwise. (2.35)

Hence the equation 2.33 reduces to:

Pr(B′

Cell|BCell, a) =∑

o,∀B′

Cell→f(BCell,a,o)

Pr(o|BCell, a) (2.36)

where, the functionf( BCell,a,o) is the function for next state estimation. The state estimation is done

on two levels. At the grid level, Markov localization is used for state estimation. It matches the acquired

readings with the readings that would be obtained from each grid.

Hence next state is estimated from the equations, 2.27 and 2.28 The next belief state B′

Grid is a vector

of all such b′

(s′

).

Which is basically what is done in equation 2.22. Once the grid level beliefs are updated, the cell

level belief state is obtained using the equation 3.5.

The rewards are calculated on the cell level using the belief state on the celllevel 3.5 as:

ρ(BCell, a) =∑

s∈S

b(s) ∗ R(BCell(s), a) (2.37)

where BCell(s) = Probability of being in cells.

R( BCell(s) , a) = Reward obtained by starting in cells and executing the actiona.

R( BCell(s) , a) can be computed by finding the information gain ( equation 2.16 and 2.17 ) ,where

the initial belief state is BCell(s) and the final belief state is the one reached from cell BCell(s) and

executing the actiona. An example is shown in section (2) for the same.

The transition function Pr( B′

Cell | BCell , a )is also calculated on the cell level, as a product of

probability of being in BCell(s) and executing an actiona.

Given the above equations, and reward ( Entropy ) calculated in equation2.17 and the transition

probability , the Q-function equation is :

Q(BCell, a) = ρ(BCell), a) + λ ∗

∑

s′

Pr(BCell(s′

)|BCell(s), a) × V (BCell(s′

))

(2.38)

19

The action selected is the one that maximizes the equation 2.38.

a = arg maxa Q(BCell, a) (2.39)

The final value function update using a TD learning algorithm is given as:

V (BCell) = V (BCell) + α ∗[

R(BCell), a) + λ ∗ V (B′

Cell) − V (BCell)]

(2.40)

20

Chapter 3

Multi-robot localization

3.1 Problem statement

Given a workspace consisting of ’n’ robots, which have multiple hypotheses about their positions.

Devise a learning policy such that localization of these robots is acceleratedby taking the advantage of

unique map features and robot-robot detections, given a map.

3.2 Related Work

In general, the work on active localization has been limited when compared to passive localization.

The pioneering work has been done by [8] and [9]. In [8], a method ofactive localization based on

maximum information gain was presented. Dudek et al. [9] presented a methodfor minimum distance

traversal for localization that works in polygonal environments without holes that they have shown to

be NP-Hard. In [10], the enviornments ambiguity in the terms of observation and robots ambiguity in

pose is modelled as a POMDP and an optimal strategy to move to the target location isdevised. In [11],

this method is extended by estimating actions which allow the robot to improve its position estimation.

All the above methods tackle the problem for a single robot. [12] devises a Bayesian approach to

enable the robots to calculate the relative poses of every other robot. Since it only estimates the relative

pose, it is unable to perform global localization. [13] uses Partially Observable Stochastic Games to

solve a multi-agent co-operative problem in a fully distributive fashion. Since it is decentralized the

cross product of the action and the observation space is considered, which exponentially increases the

complexity with the increase in the number of robots. In [14] the problems of large joint action and

state spaces is mitigated by using a multi-agent MDP that reduces the search space for an optimal

action set by considering just the upper and lower bounds of the single agent MDPs. But this method

requires isolated training of each agent seperately before undertakingthe task as a team. The method

presented here enables all the agents to get trained simultaneously, leadingto faster convergence. The

only optimal known solution to the above problem has been provided by [4] and [15]. This work is a

learning extension to them, the improvement being its semi-distributed nature as opposed to completely

21

centralized nature of the former. Since the agents are loosely coupled, they take advantage of robot-

robot detection if applicable, and continue trying to localize with the help of enviornment, if not. This

also brings in a certain level of autonomy with respect of the number of agents.

3.3 Problem description

In highly symmetric environments (both indoors and outdoors) like cubicles in an office 3.1, trees

in the open spaces 3.2 or warehouses 3.3, the sensor readings for a position, are replicated at various

positions in the map. This results in multiple hypotheses for a robots pose.

Figure 3.1Cubicles in an office

Figure 3.2Trees in an open space

As explained in the chapter 2, the belief state is estimated and updated at the gridlevel and aggre-

gated on the cell level MDP. This chapter concentrates on the algorithm at the cell level MDP. The belief

states at cell level constitutes the state space of the MDP.

Consider a workspace populated by robots, each of them having multiple location hypotheses. A prob-

abilty distribution over all location hypotheses of a robot constitutes a belief state for that robot. The

problem is to move these robots to locations so that each of the robot localizesin an optimal fashion.

22

Figure 3.3Scene of a warehouse

Frontiers [5] are boundaries separating the seen and the unseen. Thefrontiers are considered as good

places to move to, since they are easier to compute and provide a sufficient set of places to move, to

converge to a single hypothesis. The orientation is considered to be knownthrough a compass installed

on the robot. The robots can detect each other with help of an IR transreciever which provides the angle

and the distance from the robot detected. This experimental setup has been tested and can be safely

assumed to work quite robustly.

The action space consists of actions that lead to the different frontiers visible from a position. The

reward is calculated intuitively on the basis of the information gain on transition from one belief state

to another. This aims at reducing entropy. To avoid enumeration of all the possible belief states, a

CMAC[6] is used to implement the value function. It in turn acts as a function approximator for the

belief states which dont get visited during the training.

3.4 Algorithm

The algorithm is semi-distributed in nature. Most of the multi-agent learning algorithms suffer by

high computational complexity because of the joint state and action spaces forthe robots which increases

exponentially with the increase in the number of robots. The robots in this framework are loosely cou-

pled to each other. They update the same value function, but the information about other robots in the

enviornment is only limited to robot-robot detection. Thereby, the robots dont need to know the posi-

tions of all other robots in the map as well as the actions they take. In this way, the method provides an

effective solution of learning from the experiences of the other robots,but avoiding the large state and

action spaces at the same time. This feature of the algorithm also captures the robot-robot interaction

effectively hence drivingthe robots to move to frontiers where the probability of finding other robots is

maximum.

The pseudocode for the algorithm is provided below. The notations used are:

b(sn) → Is the vector of the belief stateb, nth elementsn

23

Rmap → The reward due to map information.

b → The next belief state b-prime.

V(b) → The goodness value of the belief state b.

Rmap+(rr)detection → The reward due to map information and information gain due to robot-robot de-

tection.

λ → What part of history should affect the present action selection

γ → Discount factor, how valuable is future reward as compared to the present reward, where, 0≤ γ ≤

1.

α → constant step-size parameter.

1. Make observations, Compute the belief stateb

2. Calculate the projected gain that each actiona would yield

Q(b, a) =∑

n∈N

b(sn) ∗ (Rmap + λ ∗ V (b′

)) (3.1)

3. Choose the actiona which has the highest Q value.

4. Executea

5. If another robot found

(a) Update belief state.

(b) Calculate reward Rmap+(rr)detection

6. Update V(b)

V (b) = α ∗ (Rmap+(r−r)detection + λ ∗ V (b′

) − V (b)) + V (b) (3.2)

7. Repeat until the robot is localized.

A set of robots are set up in the environment. The above algorithm is executed on all the robots indepen-

dently except the step 6. The robots start with sensing their surroundingsand calculating the belief state,

as explained in the previous chapter. The robots then, perform a one-step look ahead by calculating the

action which would yield the maximum information gain. Since its a look-ahead, the only informa-

tion gain available is due to the map features, hence the reward Rmap. The robots execute the action

which seems to be most beneficial (step 3). Now the robot looks for any possibilities of robot-robot

detections. If any robot is found in the vicinity, the angle and distance fromthe robot is supplied. With

this information, the robot tests all its hypotheses and checks which of them are now feasible. ( For,

e.g, one of the hypothesis might suggest that the robot is detected beyonda wall, discard it). Update

the new belief state and calculate the reward. Since this reward is due to both the map features ( from

step 2) and presence of other robots, its termed as Rmap+(rr)detection. This reward is jointly fed backby

24

all robots to the CMAC. Value function Vπ(b) is estimated using TD(0) ( Temporal Difference with

one-step look ahead ) method as shown in equation 3.2. This considers the feedback from prediction of

goodness value for next state ( one-step ahead in time) with the present estimate of Value function Vπ

and adjusts it accordingly. The value function estimator CMAC is shared by the robots, i.e all robots

update the same copy of CMAC. This leads to faster convergence, as the for same iteration, the training

set gets multipied by the number of robots (The belief states are independentof robots they belong to).

The above set of steps are repeated, until all robots get localized. Thetraining runs consists of various

initial configurations for varied number of robots.

To avoid getting stuck in the local minimas, at step (3), the exploration strategy of - greedy with a slight

modification is adopted. One could use strategies like softmax action selection but - greedy is used for

computational convenience. The value of is high in the initial training iterations but is decreased as

the training proceeds. This is done to avoid the unlearning. Since, the higher frequency of low rewards

can over weigh a low frequency of high reward, it is ensured that random actions are not executed too

frequently.

The state value function V(b) is estimated instead of state-action value function ,i.e Q(b,a). This has

twofold benefits. One, it reduces the computational complexity since the Q(b,a) would require maintain-

ing multiple CMACs, one for every kind of action. And second, by estimating V(b), one can assimilate

the experience and capture the interaction of the multiple robots effectively.So, in a way, the method

tries to estimate a goodness value of every belief state and guiding the robots towards the belief states

with higher goodness value.

The immediate feedback of a (state, action) pair is incorporated with the help ofone step look-ahead in

step (2) of the algorithm.

3.5 Simulation Snapshots

Different stages of the algorithm are shown in the snapshots. 3.4 shows the initial set of hypotheses

of the robot obtained after the initial scan. Each green cluster denote a virtual position or hypothesis. 3.5

shows the sensor scan from robot’s each virtual position. The area in green shows theseenarea within

the sensor range. The boundaries in the red are the frontiers, which seperate theseenfrom theunseen.

The next position to move to, is chosen from the frontier. 3.6 shows the actionwith the highest Q value

being executed. All hypotheses are moved by the distance with respect to the frontier corresponding to

the selected action. The yellow point denotes the actual position of the robot.On executing the action,

the robot detects other robots and is able to make his belief state unimodal. The robot is hence localized.

3.6 Experiments and Results

This section describes the different kind of tests performed and their corresponding results and ob-

servations.

25

Figure 3.4 Initial hypotheses of the robot

Figure 3.5Sensor scan is obtained from each virtual position

3.6.1 On different kind of maps

The method was tested on different kind of maps to verify the proof of concept. These maps were

designed in order to test the algorithm in practical situations. The learning pattern for different kind of

maps as shown in figures 3.8 and 3.9.

In figure 3.8, if the robots are present if the robots are present in the section A which is on the top

left half of the map, intuitively the best action to be taken is south, since it leadsto the unqiue features

of the map that would help to localize in the absence of the other robots. On the other hand, if all robots

are guided to south, the probabilty of robot-robot detections is also high. The algorithm learns to choose

the action south if the robot is in any of the cubicles in the map sub-section A.

In section B of figure 3.8 that has no unique features the algorithm learns tomove the robots in such a

direction (shown in the figure) that there are more robot-robot detections. In cases, where there were

not enough robots in the opposite row of cubicles, the robots learn to move towards the sides.

26

Figure 3.6Execute the decided action.

Figure 3.7Robot localized (Only one hypothesis left)

In section C of figure 3.8 , the robots clearly learn to move towards the obstacles. It is obvious due to

the fact that there is a unique feature outside every cubicle which surpasses the gain due to robot-robot

detections.

In section D of figure 3.8, the learning pattern is the most intuitive. The robotsfirst move towards the

obstacles, i.e north for the upper row of cubicles and south for the lower row. This is because of the huge

gain yielded by the action towards the obstacles when the robot was in any one of the corner cubicles

and got instantly localized. If the robot still remains unlocalized, it moves towards the space between

two cubicles in the hope of finding other robots. If it still remains unlocalized,it moves sideways to get

localized.

The above pattern was verified by a testcase consisting of 7 robots, one inevery row of cubicles. The

corresponding pattern for every section stated above was observed.

As it can be seen in the map shown in figure 3.9, localization is challenging since there are almost no

27

Figure 3.8Map A (arrows show the direction preferred by the robots)

unique map features to aid it. Hence it serves as an ideal platform to test the algorithm. The robots

learn to move in directions shown by the arrows. The space between the cubicles is where there is

highest probability of having robot-robot detections. When there are noother robots, the frontiers on

the sides are selected which lead the robots to the end of the row where they get localized with the help

of the wall. The similar pattern has been observed in Map A (Figure 3.8). Since the learning pattern

is consistent over different kind of maps, it can be safely assumed that the method would work on all

general maps.

3.6.2 With different number of robots

The second test performed was to show the flexibility of the framework to infer the results for differ-

ent number of robots without re-training. The framework, if trained for an appropriate number of robots

can be used for a varied range of number of robots without re-training each time. To prove this Map

B (Figure 3.9) was trained with test cases varied from 20 robots (one in each cubicle) to 1 robot. The

objective is to train the framework to look out for map features when there are no robots around to aid

localization and move to locations of higher robot-robot detections, as a default measure. The former is

achieved by training it extensively with lesser number (1) of robots and thelater by training with large

number of robots (20). The testing is done with a range of number of robotsand average number of

iterations required to localize a robot were recorded. Since the number ofiterations depend a lot on the

intial position of the robot, ( for e.g, w.r.t Map B, the best case for 2 robotsis, they being in the corner

and opposite cubicles), the worst and best case for a certain number ofrobots were considered and their

average was calculated.

The trend is plotted in the graph (Figure 3.10). The graph shows that on anaverage the number of

iterations required to localize a robot decreases with the increase in the number of robots. This shows

28

Figure 3.9Map B (arrows show the direction preferred by the robots)

that the framework learns to take advantage of the presence of other robots to improve its belief. On

the other hand, it does not get handicapped in the absence of other robots and uses the map features

available to localize. This as a matter of course, also proves that the framework need not be re-trained

everytime the number of robots (that need to be localized) changes. The training sample space should

consist of test cases with appropriate number of robots after which the framework can handle the entire

range of robots in between.

Figure 3.10Trend showing the average number of localization iterations Vs Number of robots

29

3.6.3 With different cell sizes

The localization performance was analyzed by changing the cell sizes of the framework. Intuitively,

it can be infered that, a lot of parameters depend on the cell size, for e,g,thesupportof the belief state,

the reward computation, sensitivity of V(b) to the negative feedback etc. Table 3.1 shows the effect

of cell size on the framework parameters for a robot’s position shown in 3.11 for one of the iterations.

The robot in this position, has two hypotheses (H1 and H2), each of whichhas two frontiers ( F11, F12

and F21 , F22) as shown in Figure 3.12. The reward is calculated for transition to every frontier by

calculating the entropy difference between the present belief state and thenext belief state (when at the

frontier ,i.e b′

). Similarly the field V(b′

) in table 3.1 shows how good is this next belief state. Table 3.2

show the final Q-values calculated using equation 2.38 for the actions and the action selected finally.

Figure 3.11The position of the robot refered to in table 3.1

As seen in the table 3.1, for the same position but different cell sizes, rewards and V(b′

) changes.

This in turn changes the Q-values corresponding to actions as shown in 3.2. This affects the action

selected by the robot at this iteration. On a collective basis, the action selected by all robots may or may

not change, and they might result in detecting or not detecting each other.Since robot-robot detections

play a crucial role in active localization, the localization performance gets directly effected with the

change in cell sizes.

Confirming that localization performance is effected by the change in cell size, the next step is to

analyze the relation between the both. The localization performance ( PL) is defined as the reciprocal of

average number of iterations required to localize one robot. The localization performance with respect

to cell sizes is shown for two maps in graphs 3.15 and 3.14. The localization performance is also

compared with respect to their random walk performances. Random walk isimplemented by taking

random actions at every belief state and ignoring the feedback rewards. The action selection in a random

walk does not depend on rewards and hence on the cell size, but nextbelief state estimation due to the

30

Position:(225,150)Cell Size(width× length) Frontier Reward V(b

′

)87× 85 F11 0.99 1.44

F21 0.99 1.42F12 0.024 1.47F22 0.048 1.46

116× 102 F11 0.99 0.766F21 0.99 0.777F12 0.024 0.825F22 0.048 0.75

139× 127 F11 0.99 0.804F21 0.99 0.784F12 0.99 0.804F22 0.048 0.788

174× 170 F11 0 0.407F21 0 0.383F12 0 0.417F22 0 0.412

232× 203 F11 0 0.242F21 0 0.331F12 0 0.265F22 0 0.253

Table 3.1For robot position (225,150) as shown in 3.12, rewards and V(b′

) with respect to the cell size.

Position:(225,150)Cell Size(width× length) Q-values Action selected87× 85 (0, 2.27, 1.355, 0) 1116× 102 (0, 1.68, 0.745, 0) 1139× 127 (0, 1.71, 1.235, 0) 1174× 170 (0, 0.355, 0.373, 0) 2232× 203 (0, 0.258, 0.233, 0) 1

Table 3.2For robot position (225,150) as shown in 3.11, Q-values and action selected with respect tothe cell size.

31

Figure 3.12Hypotheses and frontiers of the robot refered to in table 3.1

Cell Size(width× length) CMAC Tilings PL (Policy learning) PL (RandomWalk)87× 85 128 0.429 0.24116× 102 64 0.375 0.273139× 127 64 0.273 0.188174× 170 32 0.33 0.316232× 203 16 0.375 0.353

Table 3.3Localization performance with respect to change in cell sizes for Map A asshown in figure3.11

execution of an action depends on the cell size. The corresponding results are tabulated in 3.5 and 3.4

respectively.

PL =1

Iterationslocalization

(3.3)

The general trend for both the maps indicates that for a certain map, the localization performance

reaches its optimal for a certain cell size. And generally the cell size for which the performance reaches

its peak, is generally the small in size. Ostensibly, the reason for this is, a finer grid captures the

probability distribution more accurately than the coarser one and hence learn better.

32

Figure 3.13Localization performance with respect to cell sizes

Cell Size(width× length) CMAC Tilings PL (Policy learning) PL (RandomWalk)87× 85 128 0.333 0.24116× 102 64 0.375 0.261139× 127 64 0.35 0.269174× 170 32 0.259 0.25232× 203 16 0.241 0.175

Table 3.4Localization performance with respect to change in cell sizes for Map B

3.6.4 With different kind of clustering

Since the framework concentrates on using hierarchical belief states, themethod of clustering is

significant. The clustering should be such that the coarser (cell level) belief state should represent the

information in the finer belief state (grid level) with minimum loss of information. The method of clus-

tering discussed in chapter 2 involves finding the centroid for a cluster of grids, calculating the weighted

probability for the centroid and assigning it to the cell the centroid lies in. (Refer equation 2.25). Lets

call it clustering methodA for convention.

This method though computationally desirable might lead to erroneous cell assignment. Since the cen-

troid will always bepolarisedtowards the part of the cluster with grids of higher probabilities and the

information contained in the grid with lower probabilities are lost completely. It might make more sense

to assign the aggregate probabilities of the grids which lie in a certain cell rather than trying to constrain

33

Figure 3.14Trend showing the trend of localization performance Vs cell size (Map B)

the whole cluster information to just one point (centroid). Hence for this clustering method, the cell

belief state estimation becomes (i.e, equation 2.25),

BCell(si) =

∑

Grids∈siProbability of the grid

∑

∀Grids Probability of the grid(3.4)

BCell = [BCell(s1), BCell(s2)....BCell(sn)] (3.5)

Let this be clustering methodB.

It can be deduced that this type of clustering with increase thesupportof the belief states thereby im-

proving the navigation policy. The localization performance for a given map(Map 3.11) using clustering

method B was compared with clustering method A and with different cell sizes.

3.6.5 Comparison with centralized method

This method being a semi-distributed method is compared with [4] and [15] for complexity and

performance. It was found out that the [4] being a centralized method gives an optimal performance

but suffers the disadvantages of any centralized method in terms of complexity. Shown in figure 3.16,

is shown the comparison in performance with aforementioned centralized method. As it can inferred

from the figure, the centralized method gives an optimal performance whereas this method being semi-

distributed is able to give a sub-optimal performance.

34

Cell Size(width× length) CMAC Tilings PL (Clustering method A) PL (Clustering method B)87× 85 128 0.43 0.46116× 102 64 0.38 0.38139× 127 64 0.273 0.33174× 170 32 0.33 0.375232× 203 16 0.375 0.375

Table 3.5Localization performance with respect to change in cell sizes and clustering methods(Map A)

Figure 3.15Localization performance Vs Clustering methods for various cell sizes (Map A)

On the other hand, the complexity analysis showed that semi-dsitributed is a moreviable option as

compared to the centralized, since the later is exponential with respect to number of robots, whereas the

semi-distributed is polynomial with respect to number of cells, number of robotsand number of robots

detected.

3.7 Observations

This section describes some observations made during experiments or in general about the framework

or the problem tackled.

35

Figure 3.16Number of iterations to localize/robot compared for semi-distributed and centralized meth-ods

3.7.1 Self balancing property of the framework

An interesting observation was made regarding the rewards that were used as feedbacks for the

CMAC. The reward is calculated as the information gain which is basically the difference in entropies.

For actions which yield no increase in the information gain, the difference in entropies is a small negative

number instead of zero. According to the property of Markov localization which assumes that there are

no negative information gains, the negative rewards should be set to zero and fed back to the CMAC.

But if this quantity is fed back in its pristine form, it acts as negative feedbackthat prevents the robots

to get stuck in a prolonged and recursive loop of actions not leading to any progress.

One such situation has been shown in figure 3.17. The robot starts from state S4, executes an action0

and transits to stateS5. The robot then executes the action1 and ends up in stateS4. It remains stuck

in this loop for 4 iterations, until action3 is selected and the robot transists toS6and localizes. The

corresponding Q(a) values have been plotted in figure 3.18. With reference to the graph in Figure 3.18,

at time instant 1, the robot is in state S4 and Q(0) is greater than Q(3). Hencethe robot selects the action

0 and transists to state S5. At time instant 2, the robot is in state S5 and executesthe action 1 to reach

back the state S4. At time instant 3, the difference in the Q values of action 0 and 3 decreases. But Q(0)

is still greater than Q(3). Hence action 0 is selcted again. At time instant 5,the robot is in state S4 and

Q(3) finally surpasses Q(0). Action 3 is executed the robot reaches thestate S6 where it localizes. This

happens because of the negative feedback that gets added during thetransitions between S4 and S5.

36

Figure 3.17State transitions (red numbers above the arrows indicate the action taken)

Figure 3.18Variation in the Q(a) behavior with respect to the state transition

37

Chapter 4

Transfer Learning

4.1 Problem statement

Given two robots with different sensor ranges. One of the robots learn a policy to perform active

localization in a given map. The policy should be transferred to the other robot inspite of different

sensor ranges, such that it does not have to learn to localize in the map from the scratch.

4.2 Related Work

Pioneer work in active localization has been done by [8]. Fox et.al used an entropy minimization

methodology to reduce the uncertainity of robots pose in a grid based map. Anoptimal method for

co-ordinated active localization for multiple robots was proposed by [4] and [15]. A POMDP based

framework was suggested by [18] for mobile robot navigation in an indoorenviornment. A belief state

MDP was used by [19] to perform co-operative active localization for multiple robots. The idea of trans-

ferring the knowledge for an active localization task can be toyed with. Its immediate advantage would

be that if the target robot is identical to the source robot, re-training fromthe scratch can be avoided for

a given map. The other useful application that can be thought of is transferring such knowledge across

heterogenous robots, with different sensor ranges, say.

Earliest motivation behind transfer learning was enabling the robot to learnindefinitely. This idea was

discussed in detail in [20] by Thrun and Mitchell. They show the transfer of knowledge across the similar

kind of tasks, assuming that the robot and its sensor capabilities remain the same, while the environment

and goals might slightly alter across tasks. [21] used the same method of EBNN(Explanation-Based

Neural Networks) to recognize the new objects in the already familiar environment. A lot of transfer

learning has been used in robot navigation. [22] describe relational policies for transfers in spatial envi-

ronments and demonstrate significant learning improvements. [23] introducedskills, which collapse a

sequence of actions into one single operation and hence find reusable structural information in the map

for navigation. The transfer learning across dissimilar state and action spaces(domains) finds applica-

tion in robo-soccer kind of applications. [24] present an algorithm that summarizes a learned policy

38

into rules that are leveraged to achieve faster learning in a target task in another domain. Similar work

to transfer learned policies across dissimilar state and action spaces was presented in [25]. This con-

centrated on transfering policies from one domain to another via neural network based action selectors.

[26] used transfer learning in WILP ( WiFi based Indoor Localization Problem) for regression modelling

using Hidden Markov Models.

The work in this part of the thesis is novel with respect to multiple aspects. Firstly, this is the only work

done on studying transfer across different observation functions. The above cited works have consid-

ered transfer across different combinations of similar, dissimilar action andstate spaces but with the

assumption that the agents involved are identical in nature. Secondly, this is the only work that talks

about using transfer learning in the active localization context. Thirdly, thisis the first time a transfer is

attempted across belief-state MDP(s). The later is attested by the survey paper [27].

4.3 Problem description

Consider a scenario where a system of autonomous robots need to perform a navigation based task

in a given map. The system is a heterogeneous mixture consisting some of the robots with a laser range

finder and others with an array of sonar sensors. The basic difference between a LRF (Laser Range

Finder) and a sonar is its sensing range, as shown in figure 4.1. There are other differences too, but

differences with respect to only the sensing range are considered. A navigation based task requires a

robot to know its global pose in the map. In such a scenario, instead of localizing each robot separately,

it will be more efficient to localize one of them and transfer the knowledge ofhow to do it, to the

rest. This transfer should be consistent across the change in sensor ranges. As seen in the figure 4.1,

the same environment will be preceived differently due to the change in the sensing range. A transfer

should enable the other robots to not require to re-train from the scratch for a given a map. Hence,

the objective is to transfer the knowledge about the task of active localization(where should one move

to localize faster, as described in chapter 1 and 3 ) across robots with different sensor ranges. The

following sections present a preliminary analysis and some results for transfer learning across belief-

state MDP(s).

4.4 Background on Transfer learning

The core idea of transfer is that experience gained in learning to perform one task can help improve

learning performance in a related, but different, task. The insight behind transfer learning (TL) is that

generalization may occure not only within tasks, but also across tasks. Transfer techniques assume

varying degrees of autonomy and make many different assumptions. To befully autonomous, an RL

transfer agent would have to perform all of the following steps:

1. Given a target task, select an appropriate source task or set of tasks from which to transfer.

39

Figure 4.1Difference in sensing ranges of Sonar and Laser Range Finders

2. Learn how source task(s) and target tasks are related

3. Effectively transfer knowledge from source task to target task.

Currently, no TL methods are currently capable of robustly accomplishing all three goals.

For every kind of transfer one requires to know the following parametersor dimensions:

1. Task difference assumption: What is/are parameters similar and dissimilar between the source

and the target tasks. Example of things that might differ include system dynamics (transition

function), or different set of states and actions etc etc.

2. Is there one source task or many ? Is the source task(s) for transfer already selected or needs to

be selected ? This is mainly human-intervened task and should be done as accurately as possible

to prevent negative transfer.

3. Task mappings: Knowing how are a source and target task related ? Mappings can be manually

provided by human or found computationally.

4. Transferred knowledge: What is the kind of knowledge transferred ? Is is the entire policy or

partial policy ? Or the rewards ? Or a set of actions ? Different kind of knowledge may transfer

better or worse depending on task similarity.

5. Does the TL method place restrictions on what RL algorithm can be used ? Some TL methods

require that the source and target be learned with the same method. Others allow a class of

methods to be used in both tasks.

TL methods are particularly relevant in MDPs that have a large or continuousstate, as these are the

problems which are slow to learntabula rasaand for which transfer may provide substantial benefits

. [[27]] gives a detailed account of different kind of dissimilarities between the source and target tasks

(same state-action space and different transition functions, for e.g) andtransfer techniques that can be

used for aforementioned cases. The particular subset of problems thatwe will concentrate on isone

40

Allowed Task Differencesa action set may differp problem space may differ (agent space must be identical)r reward function may differsi the start state may changesf goal state may movet transition function may differv state variables may differ# number of objects in state may differ

Table 4.1Key for Allowed task differences, with reference to table 4.6

source task to one target task, with same state-action space, since the transfer scenario in our problem

falls into that category. The terms tasks and MDP(s) will be used interchangeably in this chapter.

In one of the earliest TL works for RL, Selfridge et al [28] demonstrated that it was faster to learn to

balance a pole on a cart by changing the tasks transition function,T, over time. The learner was first

trained on a long and light pole. Once it successfully learned to balance thepole the task was made

harder: the pole was shortened and made heavier. The total time spent training on a sequence of tasks

and reusing the learned function approximator (Q-values) was faster than training on the hardest task

directly. [29] uses the same technique of incremental learning but achieves it by moving the stateinitial

further and further away from the goal task for a maze-solving like problem.

Transfer between tasks in which only the reward function can differ areconsidered by [30], [33] and

[31]. [30] method successfully transfers a locally weighted regressionmodel of the transition function,

which is learned in a source task, by directly applying it to a target task. [33] uses a technique called

compositional learningand breaks down a single task into a series of temporally seperated smaller tasks.

[31] proposed a model of Hierarchical Bounded Parameter SMDP, so that partial policy can be trans-

ferred from source task to the target task.

Other works done to perform transfer over tasks with combination of different reward, transition func-

tions and states can be refered to in table 4.6 cited from [27]. The details of the cited works can be found

in the Bibliography. Many metrics to measure the benefits of transfer are possible:

1. Jumpstart: The initial performance of an agent in a target task maybe be improved by transfer

from a source task.

2. Asymptotic performance: The final learned performance of an agentin the target task maybe

improved via transfer.

3. Total Reward: The total reward accumulated by an agent (i.e area under the curve ) maybe im-

proved.

41

Transferred KnowledgeA an action setfea task featuresI experience instancesmodel task modelπ policiesπp partial policies (e.g., options)pri distribution priorspvf proto-value functionQ action-value functionR shaping rewardrule rules or advicesub subtask definitions

Table 4.2Key for Transferred Knowledge, with reference to table 4.6

Allowed Task Differencesall all previously seen tasks are usedh one source task is used (human selected)lib tasks are organized into a librarymod a human provides a source task that the agent automatically modifies

Table 4.3Key for Source Task Selection, with reference to table 4.6

4. Transfer ratio: The ratio of the total reward accumulated by the transfer learner and the total

reward accmulated by the non-transfer learner.

5. Time to threshold: The learning time needed by the agent to achieve a pre-specified performance

level maybe be reduced via knowledge transfer.

4.5 Transfer scenario in our case

Depending on the what kind of knowledge is transferred across the tasks, the transfers are divided

primarily into two kinds, a structural transfer or experience transfer. The structural transfer in-

volves the transfer of task representation characteristics, like the state orthe action space. The

experience transfer on the other hand is the transfer of experience gathered over the training runs.

This involves the transfer of value function, policy or directly samples.

In our case, the transfer is proposed across the robots with differentsensor ranges. A change in

sensor range is directly coupled to the change in observation function. The observation function

42

Allowed LearnersB Bayesian learnerBatch batch learnerCBR case based reasoningH hierarchical value-function learnerLP linear programmingMB model based learnerPS policy search learnerRRL relational reinforcement learningTD temporal difference learner

Table 4.4Key for Allowed Learners, with reference to table 4.6

TL Metricsap asymptotic performance increasedj jumpstart demonstratedtr total reward increasedtt task learning time reduced

Table 4.5Key for TL Metrics, with reference to table 4.6

is defined as in equation 4.1

O(s′

, a, o) = P (ot+1 = o|st+1 = s′

, at = a) (4.1)

where it describes the probability of getting observationo, given that the agent executed the action

a and landed in the states′

.

The source and target robots have different sensor ranges and hence for the same position, they

will perceivetheir environment differently, i.e the observation functions for the sourceand target

robot is different. Since POMDP is reduced to a MDP over belief states, thebelief-state space is

the same for both the robots, since the location for both the robots is the same. The situation has

been described in the Figures 4.2 and 4.4. The corresponding belief states have been shown in the

section 4.8.

The change in the observation function is modelled as the change in transition and reward func-

tions due to the belief state framework as shown in 4.2 and 2.17.

Pr(B′|B, a) =∑

o,∀B′→f(B,a,o)

Pr(o|B, a) (4.2)

The equation has been derived in section (2) in detail. The reward as shown in equation 2.17 also

directly depend on the observation function. The state space remains the same since the robots

43

Citation AllowedTask Dif-ferences

SourceTaskSelection

TaskMap-pings

TransferredKnowl-edge

AllowedLearners

TLMetrics

[Selfridge et al.(1985)]

t h N/A Q TD tt

[Asada et al. (1994)] si h N/A Q TD tt[Singh (1992)] r all N/A Q TD ap,tr[Atkeson and Santa-maria (1997)]

r all N/A model MB ap,j,tr

[Asadi and Huber(2007)]

r h N/A πp H tt

[Andre and Russell(2002)]

r,s h N/A πp H tr

[Ravindran and Barto(2003b)]

s,t h N/A πp TD tr

[Ferguson and Ma-hadevan (2006)]

r,s h N/A pvf Batch tt

[Sherstov and Stone(2005)]

sf ,t mod N/A A TD tr

[Madden and Howley(2004)]

s,t all N/A rule TD tt,tr

[Lazaric (2008)] s,t lib N/A I Batch j,tr

Table 4.6TL methods for transfer across same state variables and actions. N/A implies the absence ofany task mapping

44

try to localize in the same set of positions. Inspite of being in the same location, thereward

function changes due to the change in belief states estimated by the observations. The change

in the transition function is due to the distance covered by the robot when an action is executed

(Since the robot cannot move beyond it can see, hence is limited by its sensor range). This is in

turn changes the transition probabilities since the probability to land in the next state given the

present state changes. (The robot might fall short and may not move to the next cell but remain in

the same cell even after executing the action).

As explained above, this problem can be modelled as the transfer across different transition and

reward functions. It has been observed that a high-level transfer (rules, policies) are recommended

when reward function is different for the source and target tasks. Whereas a lower-level transfer (

Q values) are recommended when transition function is different for the source and target tasks.

And the source and target belief states have to be mapped before the transfer. It will be clear in

next sections how the algorithm and framework achieve the above.

4.6 Formalism for transfer

Consider two robots, RA which has a high sensor range and RB, which has a lower sensor range.

The robots start from the same initial position but due to the different sensor ranges generate dif-

ferent belief states as shown in figures 4.2 and 4.4. A successful transfer should bias the action

selection in lookahead step, so that the actions which lead to the maximum discounted rewards in

future are selected. Refering to the lookahead step in equation 3.1 in algorithm step 2(i) of section

3, there are two components which decide the action to be executed. One being, the immediate

rewards as seen from a hypothesis and weighted by the probability of being in that hypothesis.

And secondly, the goodness (value function output) for the next belief-state on executing that ac-

tion. Observing the equation 3.1, the second term can be used to bias the Q value. If the function

approximator used to calculate V(b′) already indicates which next belief-state will maximize the

reward in future, the robot can execute the corresponding action.

The transfer also requires the association of the correct source and target belief state. Its quite

plausible that the next belief state b′, for the RB might not be same as the robot RA. This is gen-

erally the case when the robot RB is near an unique feature (for eg, an obstacle) and just short of

it ( Figure 4.1). In such a case, the belief state of robot RA would show a peak in the probability

of cell nearest to the obstacle, whereas robot RB would not. In such a case the property of value

function being a function approximator comes into play. An approximation for goodness value of

the belief state is calculated if the belief state wasnt encountered in the training.There is a good

chance that b′

of RB corresponds to some belief state b′

of robot RA and the set of action which

maximize the discounted rewards in future are similar.

The third requirement of the transfer is to incorporate the new belief states of robot RB in the value

function, so that it is able to guage the goodness of the belief states of both robots. This is done by

45

the update equation 3.2, in step 5(i) of the algorithm in section 3. This results in adjusting the re-

gressor of the value function (implemented by the CMAC) locally by the belief states of robot RB.

A general algorithm to perform such a transfer is as follows: Let M1 = ( S, A, R1, P1 ) and M2 =

( S, A, R2, P2 ) be the two MDP’s

Let KM be the knowledge to be transferred.

Let A( KM ) be the learning algorithm.

A(KM ) = V/Q (4.3)

or

A(KM ) = π (4.4)

Let T( KM ) be the transfer function.

T (KM ) = K′

M (4.5)

Then the transfer process can be formulated as :

(a) Collect KM1from the source task.

(b) Collect KM2from the target task.

(c) Transfer T( KM1| KM2

) = K′

M2

(d) Learn A(KM2∪ K

′

M2)

(e) Evaluate the performance.

To summarize, the algorithm and framework satisfy all the requirements for transfer and hence

can be used with minimal change of initializing the target Q-values with the sourceQ-values and

then learning over them. If the RA has learned well (to significantly reduce the number of local-

ization iterations), the transfer is positive and will lead to improvement in performance of robot

RB.

Conceptually, new belief states and their associated actions and rewards are added to value func-

tion and the existing ones are adjusted with the new knowledge. These steps are repeated until

the robot RB gets localized. Transfer is carried out for many initial positions of the robot in the

map. The performance of the target task is compared with that of the sourcetask. The results are

discussed in detail in the results section.

46

4.7 Algorithm

The framework used is same as described in section (2). The framework isused to train a single

robot to move in the map, such it localizes in the least number of steps. The clustering method B

described in section 3 (equation 3.4 is used for transfer. Since there areno robot-robot detections

, the localization is solely influenced by the map features. The algorithm shouldhave two phases,

one for training the source robot and second for transferring the policy to the target robot. The

pseudocode for the training phase is as shown:

(a) Make observations. Calculate the belief stateb

(b) For every action

i. For all hypotheses of the robot

A. Calculate Q(a)

Q(b, a) = Q(b, a) +∑

n∈N

b(sn) ∗ (Rlookahead + γV (b′

)) (4.6)

(c) Execute the action with highest value of Q(b,a)

(d) Update V(b) with the reward

V (b) = α(R + γ × V (b′

) − V (b)) + V (b) (4.7)

(e) Repeat until the robot is localized

The algorithm starts with the robot acquiring the sensory readings of the surroundings and deter-

mining its belief state. Every non zero probability value in the belief state represents the hypothe-

sis about the robots pose and how probable the hypothesis is. A one-steplookahead is performed

to find the action which is more likely to lead to highest average gain for all hypotheses. Hence

the most favourable action is executed and reward is calculated. This reward acts as a feedback

to goodness value of that belief state. This algorithm is iterated until the robotgets localized.

The training is carried out for many initial poses of the robot. To avoid gettingstuck in the local

minimas at step (3),ǫ-greedy exploration strategy is used with the value ofǫ being 0.1 . One could

use strategies like softmax action selection butǫ-greedy is computationally convenient. As found

empirically, a total of around 120 epochs are sufficient for the MDP to converge. A CMAC with

128 tilings each with 10× 10 tiles is used for approximation of the value function.

4.8 An example

Consider two robots, RA which has a high sensor range and RB, which has a lower sensor range

as shown in figures 4.2 and 4.4.

47

Figure 4.2Robot RA with a high sensor range

Figure 4.3The belief state for the Robot RA with a high sensor range

The belief states for both the robots are shown in figures 4.3 and 4.5.

For a 6× 10 ( 6 rows and 10 coloumns) cell framework, the belief state for the robotsare :

B(RA) =

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0.068 0.112 0.194 0.116 0.084 0.214 0.097 0.116 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

(4.8)

48

Figure 4.4Robot RB with a low sensor range

Figure 4.5The belief state for the Robot RB with a low sensor range

and

B(RB) =

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0.067 0.105 0.189 0.108 0.0821 0.200 0.094 0.108 0

0 0 0 0 0 0 0 0 0 0

0 0 0.0072 0.0072 0.0079 0.0033 0.011 0 0.0079 0

0 0 0 0 0 0 0 0 0 0

(4.9)

Refering to the algorithm in the above section, transfer is done by initializing thevalue function

of Robot RB with that of Robot RA. The Q-values are generated for all actions as shown by the

equation 4.6 and the action with highest Q-value is executed. The next beliefstate is calculated

49

Iteration Q-values[ a1, a2, a3, a4] ActionSelected

1 [ 0.197, 0, 0.319, 0.431] a42 [ 1.65, 0.144, 0.517, 0.187] a13 [ 0.09, 0.03, 0.144, 0.126] a34 [ 1.55, 0.91, 1.11, 1.54] a15 [ -0.206, 0.68, -0.66, 0.416] a26 [ -0.59, 0, 0, -0.578] a47 [ 0.48, -0.056, 0.63, -0.899] a3

Table 4.7Q-values for RB with transfer

Iteration Q-values[ a1, a2, a3, a4] ActionSelected

1 [ 0.188, 0, 0.316, 0.418] a42 [ 1.64, 0.136, 0.513, 0.181] a13 [ 0.041, -0.23, -0.09, 0.06] a44 [ 0.162, 0.071, 0.06, 0.141] a15 [ 0, 0.024, 0.101, 0] a36 [ -0.42, 0.14, -0.46, -0.059] a27 [ 0.627, 0, 0, 0.414] a18 [ 0.209, 1.15, 0.130, 0] a29 [ 0.064, 0, 0, -0.814] a110 [ 0.043, 0.084, 0.052, 0] a2

Table 4.8Q-values for RB without transfer

after the action execution. The next belief state for robot RB is as shown:

B′

(RB) =

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0.097 0.074 0.149 0.141 0.128 0.064 0.222 0.081 0

0 0 0.004 0.008 0.005 0.008 0.004 0.009 0.005 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

(4.10)

The reward is calculated for this transition and the value function is updated.This constitutes a

single iteration.

As seen in the tables 4.7 and 4.8, the robot RB execute the same action for first two iterations. In

the 3rd iteration, due to the bias of the transferred V(B) values, the Q function attains a higher

50

value fora3 , which leads to faster localization (in 7 steps as compared to without transfer, in 10

steps).

4.9 Experiments and Results

This section presents the empirical results obatined for various experimentsdone with transfer

learning. The results are widely divided into two subsections. The first kind is to show that

the general trend of transfer metrics hold in an active localization task. Thesecond kind are the

observed trends in transfer learning, specific to the task of active localization.

4.9.1 Transfer Metrics

The localization performance (PL) is calculated as :

PL =1

IL

(4.11)

whereIL = Average number of iterations to localize a robot.

Figure 4.6Performance Comparison: (With transfer learning Vs Without transfer learning)

The graph shown in figure 4.6 shows a comparison of localization performance with and without

transfer learning as compared to the increase in the number of training epochs.

51

The MDP without transfer learning was trained from the scratch for the new observation function

(i.e for the different sensor range). The MDP with transfer learning was initialized with the CMAC

weights of the earlier observation function and the new learnings were bootstrapped over them.

The number of training epochs can be analogized as system’s experience, that keeps increasing by

encountering different belief states and corresponding transitions. Asexpected, with the increase

in the experience, the performance of the system improves. The observation is consistent for

both the MDP’s, irrespective of the fact, whether the transfer is performed or not. It was also

observed that performance receives a jumpstart towards the zero endof training epochs in case of

transfer learning. This confirms the intuition in the sense, that the older MDP already ”knows” the

environment and hence can partially steer the robot without any learning.The MDP trained with

transferred knowledge displays a higher learning rate, as the learning curve for the transferred

learning shows a better bias than the bias for a MDP which starts learning from the scratch. The

transfer learning also shows an average increase in the reward due to improvement in the initial

performance and a faster learning rate. This is evident because of the higher area under the curve

for the learning with transfer as compared to learning without transfer.

4.9.2 Task specific trends

Since the transfer learning in this task is primarily observation function based, a comparison in

performance is made with respect to different sensor ranges. The average number of iterations

required to localize is measured with respect to ratio between the source andthe target robot’s

sensor ranges. The graph (4.7) denote the trends. As seen in the graph, the transfer learning

becomes more effective as the ratio moves towards 1. This trend again can be justified, since

transfer learning is expected to be more effective as the source and target tasks become more

similar.

It was also observed that effects of transfer learning were more significant on sparse maps than

on cluttered maps. The two maps used for comparison are shown in figure 4.9.2.

The learning curves for the maps are shown in figure (4.9). It can be observed that the difference

in learning rates with and without transfer learning is higher in sparse maps than that of the clut-

tered maps.

It can be observed that the difference in learning rates with and without transfer learning is higher

in sparse maps than that of the cluttered maps.

The trend can be explained by closely observing the equation (4.6). The number of iterations

required to localize directly depend on action selected at every state, whichin turns depends on

the Q(b,a) [equation: (4.6)]. The calculation of the Q value is composed of two parts. One of

them being the immediate reward Rlookahead, which is obtained by observing the environment

and the other being the goodness value of the next belief state on executingthe actiona, supplied

52

Figure 4.7Performance Vs Ratio of Source and Target sensor ranges of the robot

Figure 4.8Maps: Relatively cluttered map (top) , Sparse Map(Bottom)

by the value function V(b′

). In a cluttered map, the immediate reward component dominates the

equation and allows less room for V(b′

) to manipulate the resulting Q(b,a). An another reason

can be, that for a sparse map, there is relatively less dissimilarity between thebelief state with

a large sensor range and the one with a small sensor range. This being thecase, the association

made by V(b′

) is better and transfer takes place more effectively. Hence, there was less difference

in localization performance between MDP with and without transfer learning for cluttered maps.

However, the effect of transfer was evident in comparison of Q -values. In MDP’s with trans-

ferred V(b) values, the Q value of the ”favourable” action was significantly higher thanthat of

other actions, thereby increasing their probability to get selected for the corresponding belief state.

Unlike, in sparse maps, the lack of unique features result in low reward values hence enabling the

V(b) values obtained by the transfer to domineer the action selection.

53

Figure 4.9Learning curves of Map1 (Relatively cluttered top) Vs Map2 (Sparse Map)

54

Chapter 5

Conclusions

Mobile robot localization is an instance of the general localization problem, which is the most

basic perceptual problem in robotics. Unfortunately, the problem of mobilerobot localization is

that the pose can usually not be sensed directly. In other words, most robots do not possess a

noise-free sensor for measuring pose. The pose therefore has to beinferred from data. So, in

environments which possess relatively few features that enable a robotto unambiguously deter-

mine its location, global localization algorithms can result in multiple hypotheses locations of a

robot. The idea is that by making more observations of features in the environment and matching

these to the hypotheses, the hypothesis corresponding to the true pose ofthe robot will gain most

evidence, making it distinguishable from the other, false, hypotheses. This methodology is called

Active localization.

It has been observed that active localization can be performed more efficiently if multiple robots

are involved. It involves reducing the ambiguity in the position with the help of robot-robot

detection along with the map features. A centralized method to solve this problem involves main-

taining the states/hypotheses of all robots and choosing an action which will yield the highest gain

in terms of localization. This method is expensive in the sense that large numberof computations

required in comparing the hypoetheses of multiple robots and these calculations increasely ex-

poenentially with the increase in the number of robots. An alternative to the above is a distributed

algorithm, where the robots are completely independent. But these methods suffer the bane of

large amounts of message passing.

A learning method and framework is proposed, such that algorithm is semi-distributed in na-

ture, the loose coupling between the robots accelerates the convergenceof algorithm. Also once

trained it is independent of the number of robots. The thesis aims at designing a learning frame-

work which is able to implement active localization for single and multiple robots. The major

problem faced is that of learning huge grid based belief-state spaces soas to avoid localizing in

terms of coarser map units (for eg, in terms of cubicles). To solve this problem, a system of hi-

erarchical belief states are used ,such that learning takes place over a coarse belief-state space ,

whereas the localization occurs at much finer level.

55

The framework was succesfully tested for multi-robot active localization. The test results showed

that the method is an effective way to perform Active localization for multiple robots without

dealing with huge joint state and action spaces. Since the algorithm aims at learning the unique

features in the map, whether be it obstacles or other robots, it performs a kind of feature learning

and hence is scalable to any number of robots or lack of them. That is to say,that the system

doesnt need to be retrained for every change in the number of robots. Once it is trained for an

appropriate number of robots required to capture the basic features of the map, the system would

work for large range of number of robots without re-training.

The framework was also attempted for transferring the knowledge of localizing in a given map

across robots with dissimilat sensor capabilities. In a multi-robot scenario where the robots have

dissimilar sensor capabilities, the knowledge about global localization for a map cannot be reused

and hence the robots have to be re-trained from the scratch. It explores the possibility of transfer-

ring the knowledge about globally localizing in a map from one trained robot toall other robots

using the same map, irrespective of their sensor ranges, thus avoiding thecomputationally ex-

pensive training iterations. The results as discussed show that transferlearning accelerates the

process of learning localization by a significant amount. It also observesthat the transfer depend

largely on the similarity of observation functions of the tasks. It was also observed that transfer

learning is more effective in maps devoid of many features, which is favourable and supports the

need for transfer learning in an global localization task.

The method presented here will find immediate use in various navigation and pathplanning re-

lated problems that involve the use of autonomous heterogeneous robots.

5.1 Limitations and Scope of future work

There are some limitations of this work. Firstly, the assumption is being made, that orientation is

known. Including orientation in the robots’s pose( x, y , θ ), would complicate the belief state

estimation even further. Though inclusion of the orientation in the framework would be a great

asset, it is quite challenging and can be covered in future extensions of thiswork.

The transfer learning across the belief states hasn’t been formulated completely. The results for

the test, provided are empirical and forms the basis of analysis, unlike the formulation of the

framework itself. Complete formulation can be attempted as future part of this work using the

results obtained as a starting point in the the aforementioned line of research.

56

Related Publications

Jyotika Bahuguna, B. Ravindran and K. Madhava Krishna, MDP basedActive Localization for

multiple robots, 2009, Fifth Annual IEEE Conference on Automation Scienceand Engineer-

ing(CASE).

Jyotika Bahuguna, B. Ravindran and K. Madhava Krishna, TransferLearning based Active local-

ization of robots with different sensor capabilities , 2010, ICARCV (Submitted).

57

Bibliography

[1] S Thrun, W Burgard and D Fox Probabilistic ROBOTICS. 2000. MIT Press, Cambridge, Mas-

sachusetts, London.

[2] Roland Siegwart Introduction to Autonomous Mobile Robots 2004. MIT Press, Cambridge, Mas-

sachusetts,USA.

[3] Richard S. Sutton and Andrew G. Barto Reinforcement learning - An introduction 2005. MIT Press,

Cambridge, Massachusetts,London.

[4] Richard S. Sutton and Andrew G. Barto Coordination in ambiguity: coordinated active localization

for multiple robots. 2008. AAMAS (Demos) : 1707-1708

[5] Brian Yamauchi A Frontier-Based Approach for Autonomous Exploration. July 1997, pp. 146-151.

AAMAS (Demos) : 1707-1708

[6] J.S Albus A new approach to manipulator control: The cerebellar model articulation contoller (CMAC)

July 1975 Journal of Dynamic Systems, Measurement and Control,220-227. Trans. of the ASME,

Series G

[7] Rob Schapire,Kostas Tsioutsiouliklis MDP tutorial,Foundations of Machine learning, lecture 23 April

2003 Journal of Dynamic Systems, Measurement and Control,220-227. Trans. of the ASME, Series G

[8] W. Burgard D. Fox and S. Thrun Active Markov Localizationfor Mobile Robots 1998 Robotics and

Autonomous Systems, 25, 195-207

[9] K. Romanik G. Dudek and S. Whitesides Localizing a Robot with Minimum Travel 1998 SIAM J.

Computing 27, no. 2

[10] M.L. Littman L. P. Kaelbing and A.R. Cassandra Planningand Acting in Partially Observable

Stochastic Domains 1995 Technical report, Brown University

[11] A. Cassandra L. Kaelbling and J. Kurien Acting Under Uncertainty: Discrete Bayesian Models for

Mobile Robot Navigation 1996 In Proc. of IEEE/RSJ International Conference on Intelligent Robots

and Systems

[12] M J Mataric A Howard and G S Sukhatme. Putting the I in the Team: Ego-Centric Approach to

Multi-Robotic Localization. 2003 In Proceedings of IEEE International Conference on Robotics and

Automation

58

[13] Emery-Montemerlo, R. and Gordon, G. and Schneider, J. and Thrun,S Approximate Solutions For

Partially Observable Stochastic Games With Common Payoffs. 2004 Proceeding of Autonomous

Agents and Multi-Agent Systems

[14] Mohammad Ghavamzadeh and Sridhar Mahadevan A Multiagent Reinforcement Learning Algo-

rithm by Dynamically Merging Markov Decision Processes 2002 First International Conference on

Autonomous Agents and Multiagent Systems (AAMAS) , Bologna, Italy

[15] Shivudu Bhuvanagiri, K. Madhava Krishna Active globallocalization for multiple robots by disam-

biguating multiple hypotheses 2008 IROS

[16] M. Mataric, Reinforcement learning in multi-robot domain, 1997, Autonomous Robots ,4(1);73-78

[17] Dieter Fox, Markov Localization: A Probabilistic Framework for Mobile Robot Localization and

Navigation, Dec, 1998, Doctoral thesis, University of Bonn, Germany Press,

[18] Sven Koenig, and Reid G. Simmons, A Robot Navigation Architecture Based on Partially Observable

Markov Decision Process Models, 1998, Artificial Intelligence Based Mobile Robotics: Case Studies

of Successful Robot System ,pp. 91122,

[19] Jyotika Bahuguna, B. Ravindran and K. Madhava Krishna,MDP based Active Localization for mul-

tiple robots, 2009, Fifth Annual IEEE Conference on Automation Science and Engineering(CASE),

[20] S. Thrun and T. Mitchell, Lifelong Robot Learning, 1995, Robotics and Autonomous Systems

15:2546,

[21] J. OSullivan, T. Mitchell and S. Thrun, Explanation-Based Learning from Mobile Robot Perception,

1997, Symbolic Visual Learning, Editors: K. Ikeuchi & M.Veloso. Oxford University Press,

[22] Lane, T., Wilson, A., Toward a topological theory of relational reinforcement learning for navigation

tasks , 2005, Symbolic Visual Learning, Editors: K. Ikeuchi& M.Veloso. Oxford University Press,

[23] Thrun, S., Schwartz, A, Finding structure in reinforcement learning, 1995, Advances in Neural

Information Processing Systems, vol. 7,

[24] Goyal, R.K., Egenhofer, M.J, Consistent queries over cardinal directions across different levels of

detail, 2000, Proceedings of the 11th International Workshop on Database and Expert System Appli-

cations, pp. 867880,

[25] Matthew E. Taylor, Shimon Whiteson, and Peter Stone, Transfer via Inter-Task Mappings in Policy

Search Reinforcement Learning, 2007, The Sixth International Joint Conference on Autonomous

Agents and Multiagent Systems(AAMAS),

[26] Sinno Jialin Pan, Vincent Wenchen Zheng, Qiang Yang andDerek Hao Hue, Transfer Learning for

WiFi-based Indoor Localization, July 2008, In Proc. of AAAI-08 Workshop on Transfer Learning for

Complex Task. Chicago, Illinois, USA,

[27] Matthew E. Taylor, Peter Stone, Transfer Learning in Reinforcement learning domains, 2009, In

European Conference of Machine Learning ,

[28] Selfridge, O. G.; Sutton, R. S.; and Barto, A. G., Training and tracking in robotics, 1985, In

Proceedings of the Ninth International Joint Conference onArtificial Intelligence, 670672,

59

[29] Asada, M.; Noda, S.; Tawaratsumida, S.; and Hosoda, K.,Vision-based behavior acquisition for a

shooting robot by using a reinforcement learning, 1994, In Proc. of IAPR/IEEE Workshop on Visual

Behaviors,

[30] Atkeson, C. G. and J. C. Santamaria, A Comparison of Direct and Model-Based Reinforcement

Learning, 1997, International Conference on Robotics and Automation,

[31] Mehran Asadi and Manfred Huber, Effective control knowledge transfer through learning skill and

representation hierarchies, 2007, In Proceedings of the 20th International Joint Conference on Artifi-

cial Intelligence, pages 20542059,

[32] Alexander A. Sherstov and Peter Stone, Improving action selection in MDPs via knowledge transfer,

2005, In Proceedings of Twentieth National Conference on Artificial Intelligence,

[33] Michael Kearns and Satinder Singh, Near-optimal reinforcement learning in polynomial time, 1998,

In Proc. 15th International Conf. on Machine Learning, pages 260268. Morgan Kaufmann, San Fran-

cisco, CA,

[34] David Andre and Stuart J. Russell, State abstraction for programmable reinforcement learning agents,

2002, In Proc. of the Eighteenth National Conference on Artificial Intelligence, pages 119125,

[35] Bikramjit Banerjee, Yaxin Liu, and G.Michael Youngblood, ICML workshop on Structural knowl-

edge transfer for machine learning, 2006, In ICML,

[36] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein, The complexity of

decentralized control of Markov decision processes , November,2002, Mathematics of Operations

Research, 27 (4):819840,

[37] Kimberly Ferguson and Sridhar Mahadevan., Proto-transfer learning in Markov decision processes

using spectral methods, June,2006, In Proceedings of the ICML-06 Workshop on Structural Knowl-

edge Transfer for Machine Learning,

[38] Matthew E. Taylor, Peter Stone, Cross-Domain Transferfor Reinforcement Learning, 2007, In

Proceedings of the 24th international conference on Machine learning ,

[39] Lutz Frommberger, Generalization and Transfer Learning in Noise-Affected Robot Navigation Tasks

[40] Funlade T. Sunmola,Jeremy L.Wyatt , Model Transfer forMarkov Decision Tasks via Parameter

Matching, WORKSHOP OF THE UK PLANNING AND SCHEDULING SPECIAL INTEREST

GROUP, 2006,

[41] Vishal Soni, Satinder Singh , Using Homomorphisms to Transfer Options across Continuous Rein-

forcement Learning Domains, Proceedings of the 21st national conference on Artificial intelligence -

Volume 1 , 2006,

[42] Daniel S. Bernstein , Reusing Old Policies to Accelerate Learning on New MDPs, Technical Report:

UM-CS-1999-026 , 1999,

[43] George Konidaris , A Framework for Transfer in Reinforcement Learning, In Proceedings of the

ICML-06 Workshop on Structural Knowledge Transfer for Machine Learning , 2006,

60

[44] Shai Ben-David, Reba Schuller , Exploiting Task Relatedness for Multiple Task Learning , In Pro-

ceedings of the 25th Workshop of the UK Planning and Scheduling Special Interest Group (PlanSIG

2006). ,December, 2006,

61

A Markov Decision Process (MDP) framework for active...

Documents

Transcript of A Markov Decision Process (MDP) framework for active...