A Markov Decision Process (MDP) framework for active...
Transcript of A Markov Decision Process (MDP) framework for active...
A Markov Decision Process (MDP) framework for active localization
Thesis submitted in partial fulfillment
of the requirements for the degree of
DEGREE
in
M.S by Research
by
Jyotika Bahuguna
Robotics Research Lab
International Institute of Information Technology
Hyderabad - 500 032, INDIA
May 2010
Copyright c© Jyotika Bahuguna, 2010
All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “A Markov Decision Process (MDP) framework
for active localization” by Jyotika Bahuguna, has been carried out under my supervision and is not
submitted elsewhere for a degree.
Date Adviser: Dr. Madhava Krishna
Dedicated to my family and friends, who refused to give up on me.
Acknowledgments
I thank my guide Dr. Madhava Krishna for his constant support during myresearch from showing
flexibility to let me choose my research area to ensuring that I was making progress.
I thank my co-advisor Dr. Balaraman Ravindran, who has been an inspiration to me. His constant
pursuit for excellence forced me multiple times to raise my standards and i ended up learning a lot each
time. I will especially cherish the brain storming sessions with him.
Last but not the least i thank my family and friends who were always around whenever i needed them.
v
Abstract
Robot localization is the problem of estimating a robot’s coordinates relative to an external reference
frame, say a map. Given a map of its environment, the robot has to localize itself relative to this map by
consulting its sensor data. The task of the robot is to find outwhere it is, through sensing and motion.
Localization has proved to be the most critical aspect of mobile robot navigation. There are different
forms of the localization problem:Position tracking, which seeks to compensate small dead reckoning
errors (odometry errors) under the assumption that the initial robot pose is known, andglobal self-
localization, which addresses the problem of localization with no apriori information i.e. arobot is not
told its initial pose but instead has to determine it from scratch.
In environments which possess relatively few features that enable a robotto unambiguously deter-
mine its location, global localization algorithms can result in multiple hypothesesabout the location of
a robot. This is inevitable as the local environment seen by a robot repeats at several parts of the map.
Such a situation occurs commonly in indoor navigation such as in corridorsand office rooms. For effec-
tive localization the robot has to be actively guided to those locations where there is a maximum chance
of eliminating most of the ambiguous states. This task is often referred to as‘active localization’. Active
localization is based on the decision of ‘where to move’ and ‘where to look’so as to best localize the
robot. These actions mainly comprise of moving towards unique features inthe map (e.g., distinct obsta-
cles in a static environment), such that the sensor readings become unique and hence the robot localizes.
This thesis presents a learning framework for implementing active localization.In a given map, the
framework should be able to determine what actions to execute in a state, such that localization is ac-
celerated. Since global localization poses the problem ofperceptual aliasing, a belief state instead of a
state is used to describe a robot’s position in the map. Abelief-state MDP (Markov Decision Process)
is used to capture the mapping between the robot’s belief state(s) and action(s) which will maximize the
vi
vii
expected discounted rewards. The novelty of this work lies in the fact that itproposes a hierarchical
belief state framework for the MDP to perform the active localization task. Markov localization is used
typically to estimate a belief-state from a given set of robot’s sensor readings. This belief-state space
being extensive is represented by a yet another hierarchical belief-statespace, which makes the MDP
more tractable. Hence the framework enables to perform localization with accuracy of fine grid granu-
larity ( as in Markov localization) without having to handle large state space MDP(s). A mathematical
formulation in detail for the framework has been provided.
The feasibility of this framework is shown by testing it for different aspects of active localization task
. It is used to performmulti-robot active localization, where the framework not only learns to seek out
unique features in the map, but also learns to detect other robots, therebyfacilitating localization even
in the absence of map features. It is observed that multi-robot localizationalgorithm proves to be an
intuitive, semi-distributed algorithm which is able to attain an autonomy with respect to the number of
robots. That is to say, unlike a centralized algorithm that requires re-calculation from the scratch with
the change in the number of agents involved, this algorithm can be re-usedwithout re-training for a
range of number of robots. The analysis of localization performance is done by comparing it with a
learning framework which uses random walk.
The framework is also used to performtransfer of the learned localization policy from a robot with
higher sensor capabilities to the one with the lower sensor capabilities, thereby enabling the possibil-
ity of accelerating localization in a setup with heterogenous robots. The change in the sensor range
is considered as the change in observation function in MDP context and atransfer policy is designed
to transfer the knowledge between the two belief states MDP(s). The localization performance with
transfer is analyzed by comparing with performance , when learning is done from the scratch without
any transfer. It is observed that transfer leads to a better learning rate asthe performance reaches an
optimum in lower number of training iterations. One of the observation also shows the effect of map
type on transfer and suggests the fact that the advantages of transfer is more pre-dominant in the case
of sparse maps than the dense maps. The contribution here lies in attemptingtransfer across belief
states MDP(s), since transfer has been limited to just MDP(s) till now. Also thetransfer aspect is being
applied and analyzed for active localization for the first time.
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Passive Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Active Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 31.2.1 Belief-state MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Problem statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . .. . 51.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6
2 Framework Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Markov Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Function approximator (CMAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92.3 Solving MDP(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.4 Action space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.5 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.6 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162.7 Framework formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
3 Multi-robot localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 223.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Simulation Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
3.6.1 On different kind of maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6.2 With different number of robots . . . . . . . . . . . . . . . . . . . . . . . . . 283.6.3 With different cell sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6.4 With different kind of clustering . . . . . . . . . . . . . . . . . . . . . . . . . 333.6.5 Comparison with centralized method . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .353.7.1 Self balancing property of the framework . . . . . . . . . . . . . . . . . . .. 36
viii
CONTENTS ix
4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 394.4 Background on Transfer learning . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 394.5 Transfer scenario in our case . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 424.6 Formalism for transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 454.7 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.8 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.9 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
4.9.1 Transfer Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.9.2 Task specific trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1 Limitations and Scope of future work . . . . . . . . . . . . . . . . . . . . . . . . .. 56
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
List of Figures
Figure Page
1.1 An example of global localization using sonar[1] . . . . . . . . . . . . . . . .. . . . 21.2 Example situation that shows a typical belief state during global localization ina locally
symmetric environment [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Probability density function of robot position in continuous and discrete maps. (a) Con-
tinuous map with single hypothesis, (b) Continuous map with multiple hypothesis, (c)Discretized map with probability distribution and (d) Discretized topological map withprobability distribution [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Map discretized into grids by Markov Localization . . . . . . . . . . . . . . . .. . . 82.2 Multimodal distribution of belief returned by the passive Markov localization. W is a
wall at the top disconnected from the rest of the environment. The actual robot positionis shown by a thick black dot (third cluster from left on top and labeled R. . .. . . . . 9
2.3 Cell built over the fine grid structure in a map . . . . . . . . . . . . . . . . . . . .. . 102.4 CMAC with One-dimensional input and output. The first set of layers receiving the
input are tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 112.6 Grid world example. Agent should learn to reach the goal state G . . . . . .. . . . . . 122.7 Reward distribution on transitions among the grids . . . . . . . . . . . . . . . . .. . 122.8 The value function solution for the Grid world example. The number in eachgrid depicts
the goodness value of that grid i.e, how beneficial it is to be in that grid/state.. . . . . 142.9 The Q value function solution for the Grid world example. The number in each grid
depicts the goodness value being in that state and the action executed as shown by arrow. 142.10 Action space for the robot. (Axis denote the actions) . . . . . . . . . . . .. . . . . . 15
3.1 Cubicles in an office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223.2 Trees in an open space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 223.3 Scene of a warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 233.4 Initial hypotheses of the robot . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 263.5 Sensor scan is obtained from each virtual position . . . . . . . . . . . . . .. . . . . . 263.6 Execute the decided action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 273.7 Robot localized (Only one hypothesis left) . . . . . . . . . . . . . . . . . . . .. . . . 273.8 Map A (arrows show the direction preferred by the robots) . . . . . . .. . . . . . . . 283.9 Map B (arrows show the direction preferred by the robots) . . . . . . .. . . . . . . . 293.10 Trend showing the average number of localization iterations Vs Number of robots . . . 293.11 The position of the robot refered to in table 3.1 . . . . . . . . . . . . . . . . .. . . . 30
x
LIST OF FIGURES xi
3.12 Hypotheses and frontiers of the robot refered to in table 3.1 . . . . . .. . . . . . . . . 323.13 Localization performance with respect to cell sizes . . . . . . . . . . . . .. . . . . . 333.14 Trend showing the trend of localization performance Vs cell size (MapB) . . . . . . . 343.15 Localization performance Vs Clustering methods for various cell sizes(Map A) . . . . 353.16 Number of iterations to localize/robot compared for semi-distributed and centralized
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.17 State transitions (red numbers above the arrows indicate the action taken) . . . . . . . 373.18 Variation in the Q(a) behavior with respect to the state transition . . . . . . .. . . . . 37
4.1 Difference in sensing ranges of Sonar and Laser Range Finders .. . . . . . . . . . . . 404.2 Robot RA with a high sensor range . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 The belief state for the Robot RA with a high sensor range . . . . . . . . . . . . . . . 484.4 Robot RB with a low sensor range . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5 The belief state for the Robot RB with a low sensor range . . . . . . . . . . . . . . . . 494.6 Performance Comparison: (With transfer learning Vs Without transferlearning) . . . . 514.7 Performance Vs Ratio of Source and Target sensor ranges of the robot . . . . . . . . . 534.8 Maps: Relatively cluttered map (top) , Sparse Map(Bottom) . . . . . . . . . .. . . . 534.9 Learning curves of Map1 (Relatively cluttered top) Vs Map2 (SparseMap) . . . . . . 54
List of Tables
Table Page
3.1 For robot position (225,150) as shown in 3.12, rewards and V(b′
) with respect to the cellsize. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 For robot position (225,150) as shown in 3.11, Q-values and action selected with respectto the cell size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Localization performance with respect to change in cell sizes for Map Aas shown infigure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Localization performance with respect to change in cell sizes for Map B. . . . . . . . 333.5 Localization performance with respect to change in cell sizes and clustering meth-
ods(Map A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Key for Allowed task differences, with reference to table 4.6 . . . . . . .. . . . . . . 414.2 Key for Transferred Knowledge, with reference to table 4.6 . . . . . .. . . . . . . . . 424.3 Key for Source Task Selection, with reference to table 4.6 . . . . . . . . .. . . . . . . 424.4 Key for Allowed Learners, with reference to table 4.6 . . . . . . . . . . . .. . . . . . 434.5 Key for TL Metrics, with reference to table 4.6 . . . . . . . . . . . . . . . . . .. . . 434.6 TL methods for transfer across same state variables and actions. N/A implies the ab-
sence of any task mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.7 Q-values for RB with transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.8 Q-values for RB without transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xii
Chapter 1
Introduction
Robotics is the science of perceiving and manipulating the physical world through computer-controlled
devices. Examples of successful robotic systems include mobile platforms for planetary exploration, in-
dustrial robotics arms in assembly lines, cars that travel by themselves, andmanipulators that assist
surgeons. Robotics systems are situated in the physical world, perceiveinformation on their environ-
ments through sensors, and manipulate through physical forces [1].
In order for a robot to operate autonomously, robots should be able to accomodate large amounts
of uncertainity in the physical world. The uncertainity might be in the environment around, robot’s
sensors, its actuation or its software and algorithmic approximations. For more generalized applications
where these uncertainities cannot be engineered robots will be forced totake right decisions inspite of
the uncertainity in their sensors and internal models. Thisproblem is becomingmore relevant with the
need of more and more robotic applications in the external environments.
Probabilistic robotics is a relatively new approach to robotics that pays tribute to the uncertainty in
robot perception and action. The key idea in probabilistic robotics is to represent uncertainty explicitly
using the calculus of probability theory. In other words, instead of relyingon a single “best guess” as
to what might be the case, probabilistic algorithms represent information by probability distributions
over a whole space of guesses. By doing so, they can represent ambiguity and degree of belief in a
mathematically sound way [1].
1.1 Localization
In applications which require reliable navigation and mapping, determining the robot’s pose becomes
crucial. Mobile robot localization is a problem of determining the pose of a robot relative to a given
map of the enviornment. Its often calledposition estimation, which is done by consulting the sensor
readings of the robot. Unfortunately the robot’s pose cannot be estimatedfrom a single sensor mea-
surement, either because of the noisy sensor data or/and nature of the map(has similar sub-parts). Not
every localization problem is equally hard. It depends on the nature of theenvironment and the initial
1
that a robot may possess with respect to the localization problem. The types of localization problems
depending on the type of knowledge available aprior and at run-time are listed as below.
Position tracking – which seeks to compensate small dead reckoning errors (odometry errors) under
the assumption that theinitial robot pose is known. Localizing the robot is done by accommodating the
noise in robot motion. This is based on the assumption that the pose error is small. The pose uncertainty
is often approximated by a unimodal distribution (e.g., a Gaussian). The position tracking problem is
a local problem – often calledlocal localization– since the uncertainty is local and confined to region
near the robot’s true pose.
Global self-localization– which addresses the problem of localization with no apriori information
i.e., a robot is not told its initial pose but instead has to determine it from scratch. An example of global
localization using robot with sonar sensors is shown in figure 1.1. This problem is more difficult, since
the error in the robots estimate cannot be assumed to be small. Consequently, arobot should be able to
handle multiple, distinct hypotheses – making unimodal probability distributions usually inappropriate.
Figure 1.1An example of global localization using sonar[1]
1.1.1 Passive Localization
In passive localization, the localization module onlyobservesthe robot operating. The robot is
controlled through some other means, and the robot’s motion is not aimed at facilitating localization.
For example, the robot might move randomly or perform its everyday tasks.
1.1.2 Active Localization
This set of localization problems are based on the fact that the localization algorithm controls the
motion of the robot.Active Localization algorithms control the robot so as to minimize the localization
error and/or the costs arising from moving a poorly localized robot into a hazardous place.
2
Active approaches to localization usually yield better localization results than passive ones. The
active localization is based on the decision of ‘where to move’ and ‘where tolook’ so as to best localize
the robot.
In environments which possess relatively few features that enable a robot to unambiguously deter-
mine its location,global localization algorithmscan result in‘multiple hypotheses’locations of a robot.
This is inevitable with global localization algorithms, as the local environment seen by a robot repeats
at other parts of the map. Such a situation occurs commonly in indoor navigationsuch as in corridors
and office rooms. An example of the same is shown in figure 1.2 – a robot is located in a symmetric
corridor, and its belief is centered at two (symmetric) poses. Thus, for effective localization, the robot
has to be actively guided to those locations where there is a maximum chance ofeliminating most of the
ambiguous states – which is often referred to as‘active localization’.
Figure 1.2 Example situation that shows a typical belief state during global localization in a locallysymmetric environment [1]
Such a scenario involving‘multiple hypotheses’can be represented and be worked upon with the
help of probabilistic framework of Markov Localization. In Markov Localization, the robot maintains a
probability density (belief) over the entire configuration space of the robot.The probabilistic paradigm
represents the robot’s momentary belief by a probability density function over the space of all locations.
This is illustrated in figure 1.3. Itshowsbelief representationregarding the robot position (1D) in contin-
uous and discrete maps: (a) Continuous map with single hypothesis, (b) Continuous map with multiple
hypothesis, (c) Discretized map with probability distribution and (d) Discretized topological map with
probability distribution.
1.2 Markov Decision Process (MDP)
A MDP is defined as a 4-tuple(S,A,P,R), in which S is the finite set of states, A is the finite set
of actions, P is the transition probability function and R is the reward function.If the state and action
spaces are finite, then it is called a finite Markov decision process (finite MDP). The dynamics of the
3
Figure 1.3Probability density function of robot position in continuous and discrete maps. (a) Continu-ous map with single hypothesis, (b) Continuous map with multiple hypothesis, (c) Discretized map withprobability distribution and (d) Discretized topological map with probability distribution [2]
enviornment is defined by the transition probability function
P : S × A × S → [0, 1] (1.1)
The reward function is defined as a real value bounded function
R : S × A × S → R (1.2)
A MDP satisfies the Markov property, which states that one-step dynamics can predict the next state
and expected next reward given the current state and action. The objective is to find a value-function
V(s), which is a function of states that estimates how good it is for the agent to be in agiven state. The
notion of ”how good” here is defined in terms of future rewards that can be expected, or to be precise,
in terms of expected rewards.Of course, the rewards that the agent canexpect depend on what actions
it will take. Accordingly, value functions are defined with respect to particular policies. A policy,π, is
a mapping from each state,s∈ S and actiona ∈ A, to the probabilityπ(s,a) of taking actiona in state
s.[3] A value function hence,Vπ(s) determines the value of states under the policyπ, i.e the expected
return when starting ins and following the policyπ thereafter. The solution of the value function is
given by the Bellman’s optimality equation. For alls∈ S:
V (s) = maxa∈A(Σs′Pass′ [R
ass′ + γV (s′)]) (1.3)
The Bellman equations can be solved using iterative algorithms, where we start with an initial guessV0
and iterate for every states:
Vk+1(s) = maxa∈A(Σs′∈SP ass′ [R
ass′ + γVk(s
′)]) (1.4)
As k goes to infinity, Vk converges to the optimal policy, V∗.
4
1.2.1 Belief-state MDP
A POMDP (Partially Observable Markov Decision Process) model decisionproblems in which an
agent tries to maximize its reward in the face of limited and/or noisy sensor feedback, i.e state is only
partially observable. A POMDP is a tuple( S,A,T,R,O,ρ ) where S is a set of states, A a set of actions
andρ a set of observations. The functions T and R define a MDP with which the agent reacts without
direct information as to the current state. The agent’s decisions are made based on information from its
sensors (observations) formalized by:
O : S × A × S → Π(ρ) (1.5)
The algorithms to solve a POMDP are computationally intensive and hence a POMDP is generally
converted to a belief-state MDP. A belief state is a probability distribution over allthe states of the
underlying environment. The notationb(s) is used to indicate the agent’s belief that it is in stateswhen
current belief state isb ∈ Π (S). Using the model, belief states can be updated based on the action’s
observations in a way that makes the beliefs correspond exactly to state occupation probabilities.
All future references to MDP should be assumed as a belief-state finite MDP, unless specified.
1.3 Problem statement and Contributions
The objective of this thesis is to design a learning framework for the problemof active localization.
The frameworkshould be able to handle some important aspects of active localization. The advantage
of a such a framework is finding applications of research in the learning domain in active localization
context. The framework covers ground on two aspects of active localization. Initiating from localization
of a single robot, it is able to take advantage of multiple robots in the environmentand accelerate the
localization in a distributed algorithm. Secondly, the concept of transfer in MDP(s) is successfully used
to transfer the knowledge of ”how to localize” among robots with different sensory capabilities for a
given map. Hence the thesis deals with two sub-problems:
1. Given a workspace consisting of ’n’ robots, devise a learning policy thatsuch localization is
accelerated by taking the advantage of unique map features and robot-robot detections, given a
map.
2. Given two robots with different sensor ranges, transfer the policy to localizefrom one robot to
another , given a map.
Since temporal relations to be modelled in the above cases is relative, a MDP is an ideal choice.
Also no information is available apriori, a condition of perceptual aliasing(many different places can be
percieved to be same) is imposed. Each of this percieved place can be thought of as a hypothesis. A
5
belief state, hence is a probability distribution over these hypotheses. A belief state MDP can be used to
incorporate this uncertainity. Markov localization technique is used for belief state estimation.
In general, work on active localization has been limited as compared to passive localization. The
only work known onactive localization of multiple robots used a centralized algorithm that used robot-
robot detections for an optimal way to localize in minimum number of steps[4]. Being a centralized
algorithm, it suffers the disadvantage of increase in the number of computations with increase in the
number of robots. A learning framework introduces the option to use a distributed algorithm to perform
the same task. Using a learning framework to perform activelocalization formultiple robots is a novel
approach as far as the knowledge of the author goes. A detailed analysisof implications of using such a
framework for the task is an added contribution.
Also, the idea of re-using the localizing knowledge across robots with different sensor ranges presented
in the thesis is novel. This bit of research is also significant from the machinelearning point of view,
since till date transfer was limited to among MDP(s) and not belief MDP(s).
1.4 Organization of thesis
The present chapter serves as an introduction. Section (1.1) gives a background about the problem of
localization and the motivation behind the need to solve it. Section (1.2) introduces the generic Markov
Decision Process (MDP) and the kind of problems that can be solved usingthem. Section (1.3) puts
forth the problem statement and discuss about the contributions of this thesis. Section (1.4), describes
the organization of this thesis.
Chapter 2 describes the framework design. Section 2.1 describes the methodof Markov localization,
which is central to the framework. Section 2.2 describes the function approximator Cerebellar Model
Articulation Controller(CMAC) and its role in the framework. Section 2.3 describes the working of a
typical MDP in detail. Section 2.4 , 2.5 and 2.6 introduce the concepts of MDP action space, rewards
and observations in the active localization context. Section 2.7 formulates the framework from the con-
solidated knowledge of all the above mentioned sections.
Chapter 3 describes how the problem of active localization involving multiple robots can be solved
using the MDP framework. Section 3.1 states the problem statement. Section 3.2 discusses the previous
work done in the area. Section 3.3 discusses the problem statement in detail. Section 3.4 explains the
algorithm for multi-robot active localization. Section 3.5 explains the various stages of algorithm with
the help of snapshots. Section 3.6 explains the various experiments done to analyze the method. Section
3.7 describes some interesting observations made during the experiments.
Chapter 4 describes the concept of transferring the knowledge of ”how to localize” across the robots
with different sensor capabilities. Section 4.1 states the problem statement. Section 4.2 describes the
6
related work with transfer learning and active localization context. Section 4.3 describes the problem
statement. Section 4.4 gives a background on transfer learning methods , metrics and parameters. Sec-
tion 4.5 explains the scenario of transfer handled in the thesis. Section 4.6 and 4.7 explains the algorithm
to perform the transfer with the existing framework. Section 4.8 explains the above with the help of an
example. Section 4.9 describes the experiments performed to verify transfer and results obtained.
Finally chapter 5 gives the concluding remarks provides pointers for the further research.
7
Chapter 2
Framework Design
This chapter explains the framework details. A belief state MDP requires belief state to be estimated
from a set of sensor readings. This is done by the means of passive Markov localization.
2.1 Markov Localization
Markov localization discretizes the map into grids of adequately small size, so that localization
within the grid is unimportant. Each grid is represented by a(x,y) value, which can be the centre of the
grid. The discussed structure is shown in figure 2.1 and 3.4. The sensorreadings obtained by the robot
Figure 2.1Map discretized into grids by Markov Localization
is compared with sensor readings that will be obtained in each of this virtual grid position. And a prob-
ability distribution is generated on the basis of the similarity of the actual sensor readings and virtual
sensor readings. The probabilities only above a certain threshold are considered, which lead to cluster
like formations, each of which indicates a virtual position (hypothesis) of therobot. This is shown in
8
the figure (2.2).
Active localization involves moving in directions such that this distribution becomes unimodal, i.e
Figure 2.2 Multimodal distribution of belief returned by the passive Markov localization.W is a wallat the top disconnected from the rest of the environment. The actual robot position is shown by a thickblack dot (third cluster from left on top and labeled R.
robot is no longer confused about its position. The direction always points to one of the surrounding
frontiers from a hypothesis location. These frontiers at each hypothesis location are computed as in [5].
The reward on reaching the frontier is calculated by re-estimating the belief state taking into account
the new information and then subtracting the entropies of the former and later belief states. A positive
reward indicates reduction in entropy (strengthening the belief about robot’s position).
The aim of the framework is to estimate a mapping between the belief states and the direction to
move ,such that the expected reward is maximized in the long run. Ideally the entire grid framework
should compose the input belief state for the MDP. But due to larger number of grids, belief state
space would be extensive and update operations with respect to the readings from each grid renders the
MDP intractable. Markov localization applies clustering over the probability distributions and make the
hypotheses enumeration more manageable. Clusters could be used as an input for the MDP, but number
of clusters that can be formed might vary depending on the position in the map.Hence, the requirement
for a coarser grid-like framework over the clusters.
A hierarchy of grids over this underlying fine grid structure is built , whichis called acell. The concept
of a cell is shown in the figure 2.3 The cell structure is much coarser in granularity ascompared to the
grid structure and hence makes a MDP much more feasible. The state space of the MDP now comprises
of probability distributions across the cells.
2.2 Function approximator (CMAC)
As stated in the previous section, the aim is to find a mapping between the belief states and the
direction to move, which in turn is decided by which action to execute. Since the state space is not dis-
9
Figure 2.3Cell built over the fine grid structure in a map
crete, (i.e there can be infinite combinations of probability distributions among thecells), its impossible
to enumerate all the possible belief states. Hence, a function approximator is used to approximate the
mapping (the value function V(b) ).
A CMAC (Cerebellar Model Articulation Controller) is a sparse coarse-coded associative memory algo-
rithms that mimic the functionality of the human cerebellum. In a CMAC, the input space is quantised
using a set of overlapping tiles as shown in refcmac-tiles For input spacesof high dimensionality, the
tiles form hypercubes. A query is performed by first activating all the tilesthat contain a query point.
The activated tiles in turn activate memory cells which contain stored values; theweights of the sys-
tem. The summing of these values produces an overall output. A change in theinput vector results in a
change in the set of activated tiles, and therefore a change in the set of memory cells participating in the
CMAC output. More details about the CMAC implementation can be found in [6].
2.3 Solving MDP(s)
A MDP is defined as a 4-tuple(S,A,P,R), in which S is the finite set of states, A is the finite set of
actions, P is the transition probability function and R is the reward function. Ifthe transition probabilities
are known, solving a MDP becomes a straightforward computational problem, however if they are
unknown then it is a problem for reinforcement learning.
Consider an example:
An agent (or robot) exists in a certain environment. The environment consists of states, and the agent
moves between states in this environment. Based on the current state information the agent decides
10
Figure 2.4CMAC with One-dimensional input and output. The first set of layers receiving the input aretiles.
which state to move to next. Depending on the agent’s action, the environment returns a new state
information and some reward. The goal in this problem is to maximize the agent’s reward. The above
Figure 2.5Reinforcement learning
interaction between the agent and the environment can be described as shown in 2.3 Here, the agent
starts at state s0 and initially executes action a0. This gets the agent reward r1 and takes him to state s1.
At state s1, he executes action a1, which gets him reward r2 and takes him tostate s2. And so on.
Define a policyπ : S→ A to be a mapping between what has happened in the past and what has to be
done at the current state. S is the set of states and A the set of possible actions. The goal is to find a
policy that maximizes the reward. We define Vπ(s) to be the value of policyπ when starting at state s.
We can write:
V π(s) = Eπ[rt+1 + rt+2 + rt+3 + ...|st = s] (2.1)
at = π(st) (2.2)
11
The sum in equation 1 is typically in
nite. In a more realistic setting the rewards in the future get discounted, and equation 2.3 becomes:
V π(s) = Eπ[rt+1 + λrt+2 + λ2rt+3 + ...|st = s] (2.3)
whereλ ∈ [0,1) is the decay factor.
Consider the agent that moves in the grid shown in 2.6, let’s say deterministically. If it moves to the goal
state G, it gets reward = 100. If it moves anywhere else it gets reward = 0.So, the picture becomes: Let’s
Figure 2.6Grid world example. Agent should learn to reach the goal state G
also assume that the decay factor = 0:9. The reward distribution is shown in 2.7. Before we address the
Figure 2.7Reward distribution on transitions among the grids
issue of finding a policy, we look at how to evaluate a known policy B at a given state st . We know that
12
from equation 2.3 that:
V π(s) = Eπ[rt+1 + λrt+2 + λ2rt+3 + ...|st = s] (2.4)
= Eπ[rt+1|st] + λEπ[rt+2 + λrt+3 + ...|st] (2.5)
= Eπ[rt+1 + λV π(st+1)|st] (2.6)
The next state function is :
P (s′
|s, a) = Pr[st+1 = s′
|st = s, at = a] (2.7)
Where it is the probability of landing up in state s′
at timet+1, when the initial state at timet is st and
action executed at timet is at.
The expected reward is :
r(s, a) = E[rt+1|st = s, at = a] (2.8)
Where its the reward at timet+1, when the earlier state at time instantt was st and action executed was
at. Hence the equation 2.6 becomes:
V π(s) = r(s, a) + λ∑
s′
P (s′
|s, a)V π(s′
) (2.9)
Which, as mentioned in the earlier chapter is the Bellman’s equation. Its a linear equation withn
unknowns, wheren is the number of states, and it has a unique solution. Another way is to solve it iter-
atively as shown in equation 1.4. Unfortunately, neither solution is satisfying. For a simpler approach,
we can experiment with the environment to find an approximate to the solution. This is done during the
training phase of the MDP. We start with some initial value of value function forall states. We choose
to transit to some next state either chosen at random or by some look ahead of gain that the transition
will yield (exploration or exploitation respectively). The reward for the action executed is used to adjust
the goodness value for the state. This procedure is repeated for for different state-state transitions. This
constitutes the training phase.
The value function Vπ for the above grid problem is found as shown in 2.8. An agent in the upper
middle grid knows that it should move right to reach the goal state.
Lets, describe the Q value function, Q∗(s,a) is the value of starting at states, executing actiona and
from there on following optimal policyπ∗. That is:
Q∗(s, a) = E[rt+1 + λV ∗(st+1)|st = s, at = a] (2.10)
The iterative version of the equation 2.10 is:
Q∗(s, a) = r(s, a) + λ∑
s′
P (s′
|s, a)V π(s′
) (2.11)
13
Figure 2.8 The value function solution for the Grid world example. The number in each grid depictsthe goodness value of that grid i.e, how beneficial it is to be in that grid/state.
Figure 2.9The Q value function solution for the Grid world example. The number in each grid depictsthe goodness value being in that state and the action executed as shown by arrow.
Intuitively, if Q∗ is known, the optimal policyπ∗ is known. Q∗ can be calculated from Vπ.
π∗(s) = argmaxaQ∗(s, a) (2.12)
Q∗ for the grid example is shown in 2.9.
The example explained above describes how to solve a typical MDP. The Bellman’s equation (2.9)
is used when the model of an environment is known. For environments for which the dynamics are
unknown, a Temporal Difference (TD) learning algorithm can be used.In case like ours, where the
reward and the next-state probability distributions are not known apriori, TD version of value function
is used as shown in (3.2).
V (s) = α(R + γ × V (s′
) − V (s)) + V (s) (2.13)
The obvious advantage of TD methods are that they are naturally implemented inan on-line and fully
incremental fashion. It has been proved, that for any fixed policyπ, a TD algorithm will converge to
Vπ, in the mean for a constant step-size parameter, if it is sufficiently small.
14
2.4 Action space
The action space for the belief state MDP consists of 4 actions, namely Up, Down, Left, Right. The
next position to move to, on executing the action depends on frontier found inthat direction. Imagine
the robot’s scanned area being divided into 4 quadrants as shown in figure 2.10. The frontier that lies in
Figure 2.10Action space for the robot. (Axis denote the actions)
theUP-RIGHTquadrant is closer to theUP axis and is assigned toUP action. Similarly the frontier that
lies in theRIGHT-DOWNquadrant , being closer to theDOWNaxis is assigned to theDOWNdirection.
On executing an action, the corresponding frontier is reached. The convention followed in the thesis is:
0 = UP, 1 = DOWN, 2 = LEFTand 3 =RIGHT.
2.5 Rewards
The belief states indicate the entropy about a robot’s position. Hence, reward is a measure of re-
duction in entropy (gain in information) on taking an action. The entropyH is calculated by equation
2.14
H = −n
∑
i=1p(xi) ∗ logbp(xi) (2.14)
Let B be a belief state,
B = [b(s1), b(s2)....b(sn)] (2.15)
With respect to the belief states, the entropy can be calculated as:
HB = −n
∑
i=1b(si) ∗ logbb(si) (2.16)
where HB = entropy indicated by the belief stateB.
b = base e or 10 or 2
15
GainInfo = HB′ − HB (2.17)
where B′
is the next belief state reached after transition from B.
2.6 Observations
The observations in a belief-state MDP play the role of limiting the possible next belief states after
executing an action by directly calculating the next belief state. The observations in our case are the
sensor reading obtained by the robot.
2.7 Framework formulation
The framework uses a hierarchical belief state model as described in section Markov localization
of this chapter. A belief state is estimated at the grid as well as the cell level. Theinteraction between
the levels (cell and grid) is such that, the belief state update occurs at the grid level but the rewards
and transition function are computated at the cell level. If a MDP is assumed atthe grid level, with the
following parameters :
MDPGrid =< BGrid, A, P, R, O, T > (2.18)
wheres in B(s) represents a grid
BGrid(s) represents the probability of being in grids .
BGrid is the belief state representing the probability of being in each grid. i.e
BGrid = [bGrid(s1), bGrid(s2)....bGrid(sn)] (2.19)
where bGrid(s1) = probability that the robot is in grid s1
bGrid(sn) = probability that the robot is in grid sn.
A = action space, consisting of 4 basic actions< Up, down, Left, Right>
T = Transition probability i.e Pr( B′
Grid | BGrid, a)
Ω = set of observations obtained.
R = Reward
O = Observation function.
The transition function from belief state BGrid to B′
Grid using an actiona is deterministic given an
observationo and defines theT, transition function. It is also denoted as :
B′
Grid = T (BGrid, a, o) (2.20)
where,
16
Experimentally, the computation of the belief state over the grids is done as shown :
Pr(Robot in grid|Readings) =Pr(Readings|Robot in grid) · Pr(Robot in grid)
∑
Pr(Reading|Robot in grid) · Pr(Robot in grid)(2.21)
where Pr( Readings| Robot in grid) = Probability of getting the readings,Readings, when the robot
is in grid, say s1.
Pr( Robot in grid| Readings) = Probability of robot being in grid s1, when readings obtained are
Readings.
If there is no prior belief about robot position, the prior Pr( Robot in Grid) is irrelevant. Hence
the equation 2.21 becomes:
Pr(Robot in grid|Readings) =Pr(Readings|Robot in grid)
∑
Pr(Reading|Robot in grid)(2.22)
which is basically the probability of obtainingReadingsin grid s1 normalized over the probability
of observingReadingsover all grids( Markov localization ).
Hence,
BGrid(s1) =Pr(Readings|Robot in grid s1)
∑
grids Pr(Reading|Robot in grid s1)(2.23)
and Belief state BGrid(s) is a vector of beliefs for all grids as shown in equation 2.19.
The belief state on grid level makes a MDP intractable due to extensiveness of the belief state space.
As described in the earlier sectionMarkov Localization, a hierarchical cell level is constructed over the
fine grid structure to make the belief states manageable.
A belief-state MDP over the cells is described as:
MDPCell =< BCell, A, P, R, O, T > (2.24)
wheres in BCell(s) represents a cell and the later represents the probability of being in cells. Markov
localization results in formation of clusters of grids to represent a robot’s various virtual positions or
hypotheses. A cluster’s mean and the probability forms the position and probability of the corresponding
hypothesis. Since virtual position denoted by the cluster centroid is alreadycalculated and so is the
probability of the virtual position, the belief state on cell level is a mere renormalization of probabilities
on the grid level.
BCell(si) =Probability assigned to the centroid lying in the cell si
∑
siProbability assigned to the centroid lying in the cell si
(2.25)
BCell = [BCell(s1), BCell(s2)....BCell(sn)] (2.26)
17
The next belief state update on an action execution is done on the grid level again.
B′
Grid(s′
) =O(a, s
′
, o)∑
s BGrid(s)Pr((s′
)|s, a)
Pr(o|BGrid, a)(2.27)
and
pr(o|BGrid, a) =∑
s∈S
BGrid(s) · Pr((s′
)|s, a)∑
s′
O(a, s′
, o) (2.28)
The equation 2.28 is derived as shown:
For belief-update in a grid like framework,
where Bel( st−1 ) = Belief state at time t-1
st−1 = state at time t-1
st = state at time t
Pr(o|Bel(st−1), a) =∑
stPr(o|st, Bel(st−1), a) · Pr(st|Bel(st−1), a)
=∑
st
Pr(o|st) · Pr(st|Bel(st−1), a)
(2.29)
Now, Pr( st | Bel(st−1),a) is,
Pr(st|Bel(st−1), a) =∑
st−1Pr(st|st−1, Bel(st−1), a) · Pr(st−1|Bel(st−1))
=∑
st−1
Pr(st|st−1, a) · Pr(st−1)
(2.30)
From derivations 2.29 and 2.30,
Pr(o|Bel(st−1), a) =∑
st
Pr(o|st) ·∑
st−1
Pr(st|st−1, a) · Pr(st−1) (2.31)
Refer to the equation 2.28 above.
After the next belief state update, the belief state at cell are computed using the equation 2.25. To ex-
press the transition probability in the terms of observation functions, we use the conditional probability
axiom:
Pr(A|B, C) =∑
D
Pr(A|B, C, D) · Pr(D|B, C) (2.32)
18
The transition probability at the cell level is given as :
Pr(B′
Cell|BCell, a) =∑
o∈Ω
Pr(B′
Cell|BCell, a, o) · Pr(o|BCell, a) (2.33)
Where o = set of observations obtained when in belief state B′
Cell. The transition function formulates
as:
Pr(B′
Cell|BCell, a, o) = 1 If, B′
Cell → f(BCell, a, o) (2.34)
= 0 Otherwise. (2.35)
Hence the equation 2.33 reduces to:
Pr(B′
Cell|BCell, a) =∑
o,∀B′
Cell→f(BCell,a,o)
Pr(o|BCell, a) (2.36)
where, the functionf( BCell,a,o) is the function for next state estimation. The state estimation is done
on two levels. At the grid level, Markov localization is used for state estimation. It matches the acquired
readings with the readings that would be obtained from each grid.
Hence next state is estimated from the equations, 2.27 and 2.28 The next belief state B′
Grid is a vector
of all such b′
(s′
).
Which is basically what is done in equation 2.22. Once the grid level beliefs are updated, the cell
level belief state is obtained using the equation 3.5.
The rewards are calculated on the cell level using the belief state on the celllevel 3.5 as:
ρ(BCell, a) =∑
s∈S
b(s) ∗ R(BCell(s), a) (2.37)
where BCell(s) = Probability of being in cells.
R( BCell(s) , a) = Reward obtained by starting in cells and executing the actiona.
R( BCell(s) , a) can be computed by finding the information gain ( equation 2.16 and 2.17 ) ,where
the initial belief state is BCell(s) and the final belief state is the one reached from cell BCell(s) and
executing the actiona. An example is shown in section (2) for the same.
The transition function Pr( B′
Cell | BCell , a )is also calculated on the cell level, as a product of
probability of being in BCell(s) and executing an actiona.
Given the above equations, and reward ( Entropy ) calculated in equation2.17 and the transition
probability , the Q-function equation is :
Q(BCell, a) = ρ(BCell), a) + λ ∗
∑
s′
Pr(BCell(s′
)|BCell(s), a) × V (BCell(s′
))
(2.38)
19
The action selected is the one that maximizes the equation 2.38.
a = arg maxa Q(BCell, a) (2.39)
The final value function update using a TD learning algorithm is given as:
V (BCell) = V (BCell) + α ∗[
R(BCell), a) + λ ∗ V (B′
Cell) − V (BCell)]
(2.40)
20
Chapter 3
Multi-robot localization
3.1 Problem statement
Given a workspace consisting of ’n’ robots, which have multiple hypotheses about their positions.
Devise a learning policy such that localization of these robots is acceleratedby taking the advantage of
unique map features and robot-robot detections, given a map.
3.2 Related Work
In general, the work on active localization has been limited when compared to passive localization.
The pioneering work has been done by [8] and [9]. In [8], a method ofactive localization based on
maximum information gain was presented. Dudek et al. [9] presented a methodfor minimum distance
traversal for localization that works in polygonal environments without holes that they have shown to
be NP-Hard. In [10], the enviornments ambiguity in the terms of observation and robots ambiguity in
pose is modelled as a POMDP and an optimal strategy to move to the target location isdevised. In [11],
this method is extended by estimating actions which allow the robot to improve its position estimation.
All the above methods tackle the problem for a single robot. [12] devises a Bayesian approach to
enable the robots to calculate the relative poses of every other robot. Since it only estimates the relative
pose, it is unable to perform global localization. [13] uses Partially Observable Stochastic Games to
solve a multi-agent co-operative problem in a fully distributive fashion. Since it is decentralized the
cross product of the action and the observation space is considered, which exponentially increases the
complexity with the increase in the number of robots. In [14] the problems of large joint action and
state spaces is mitigated by using a multi-agent MDP that reduces the search space for an optimal
action set by considering just the upper and lower bounds of the single agent MDPs. But this method
requires isolated training of each agent seperately before undertakingthe task as a team. The method
presented here enables all the agents to get trained simultaneously, leadingto faster convergence. The
only optimal known solution to the above problem has been provided by [4] and [15]. This work is a
learning extension to them, the improvement being its semi-distributed nature as opposed to completely
21
centralized nature of the former. Since the agents are loosely coupled, they take advantage of robot-
robot detection if applicable, and continue trying to localize with the help of enviornment, if not. This
also brings in a certain level of autonomy with respect of the number of agents.
3.3 Problem description
In highly symmetric environments (both indoors and outdoors) like cubicles in an office 3.1, trees
in the open spaces 3.2 or warehouses 3.3, the sensor readings for a position, are replicated at various
positions in the map. This results in multiple hypotheses for a robots pose.
Figure 3.1Cubicles in an office
Figure 3.2Trees in an open space
As explained in the chapter 2, the belief state is estimated and updated at the gridlevel and aggre-
gated on the cell level MDP. This chapter concentrates on the algorithm at the cell level MDP. The belief
states at cell level constitutes the state space of the MDP.
Consider a workspace populated by robots, each of them having multiple location hypotheses. A prob-
abilty distribution over all location hypotheses of a robot constitutes a belief state for that robot. The
problem is to move these robots to locations so that each of the robot localizesin an optimal fashion.
22
Figure 3.3Scene of a warehouse
Frontiers [5] are boundaries separating the seen and the unseen. Thefrontiers are considered as good
places to move to, since they are easier to compute and provide a sufficient set of places to move, to
converge to a single hypothesis. The orientation is considered to be knownthrough a compass installed
on the robot. The robots can detect each other with help of an IR transreciever which provides the angle
and the distance from the robot detected. This experimental setup has been tested and can be safely
assumed to work quite robustly.
The action space consists of actions that lead to the different frontiers visible from a position. The
reward is calculated intuitively on the basis of the information gain on transition from one belief state
to another. This aims at reducing entropy. To avoid enumeration of all the possible belief states, a
CMAC[6] is used to implement the value function. It in turn acts as a function approximator for the
belief states which dont get visited during the training.
3.4 Algorithm
The algorithm is semi-distributed in nature. Most of the multi-agent learning algorithms suffer by
high computational complexity because of the joint state and action spaces forthe robots which increases
exponentially with the increase in the number of robots. The robots in this framework are loosely cou-
pled to each other. They update the same value function, but the information about other robots in the
enviornment is only limited to robot-robot detection. Thereby, the robots dont need to know the posi-
tions of all other robots in the map as well as the actions they take. In this way, the method provides an
effective solution of learning from the experiences of the other robots,but avoiding the large state and
action spaces at the same time. This feature of the algorithm also captures the robot-robot interaction
effectively hence drivingthe robots to move to frontiers where the probability of finding other robots is
maximum.
The pseudocode for the algorithm is provided below. The notations used are:
b(sn) → Is the vector of the belief stateb, nth elementsn
23
Rmap → The reward due to map information.
b → The next belief state b-prime.
V(b) → The goodness value of the belief state b.
Rmap+(rr)detection → The reward due to map information and information gain due to robot-robot de-
tection.
λ → What part of history should affect the present action selection
γ → Discount factor, how valuable is future reward as compared to the present reward, where, 0≤ γ ≤
1.
α → constant step-size parameter.
1. Make observations, Compute the belief stateb
2. Calculate the projected gain that each actiona would yield
Q(b, a) =∑
n∈N
b(sn) ∗ (Rmap + λ ∗ V (b′
)) (3.1)
3. Choose the actiona which has the highest Q value.
4. Executea
5. If another robot found
(a) Update belief state.
(b) Calculate reward Rmap+(rr)detection
6. Update V(b)
V (b) = α ∗ (Rmap+(r−r)detection + λ ∗ V (b′
) − V (b)) + V (b) (3.2)
7. Repeat until the robot is localized.
A set of robots are set up in the environment. The above algorithm is executed on all the robots indepen-
dently except the step 6. The robots start with sensing their surroundingsand calculating the belief state,
as explained in the previous chapter. The robots then, perform a one-step look ahead by calculating the
action which would yield the maximum information gain. Since its a look-ahead, the only informa-
tion gain available is due to the map features, hence the reward Rmap. The robots execute the action
which seems to be most beneficial (step 3). Now the robot looks for any possibilities of robot-robot
detections. If any robot is found in the vicinity, the angle and distance fromthe robot is supplied. With
this information, the robot tests all its hypotheses and checks which of them are now feasible. ( For,
e.g, one of the hypothesis might suggest that the robot is detected beyonda wall, discard it). Update
the new belief state and calculate the reward. Since this reward is due to both the map features ( from
step 2) and presence of other robots, its termed as Rmap+(rr)detection. This reward is jointly fed backby
24
all robots to the CMAC. Value function Vπ(b) is estimated using TD(0) ( Temporal Difference with
one-step look ahead ) method as shown in equation 3.2. This considers the feedback from prediction of
goodness value for next state ( one-step ahead in time) with the present estimate of Value function Vπ
and adjusts it accordingly. The value function estimator CMAC is shared by the robots, i.e all robots
update the same copy of CMAC. This leads to faster convergence, as the for same iteration, the training
set gets multipied by the number of robots (The belief states are independentof robots they belong to).
The above set of steps are repeated, until all robots get localized. Thetraining runs consists of various
initial configurations for varied number of robots.
To avoid getting stuck in the local minimas, at step (3), the exploration strategy of - greedy with a slight
modification is adopted. One could use strategies like softmax action selection but - greedy is used for
computational convenience. The value of is high in the initial training iterations but is decreased as
the training proceeds. This is done to avoid the unlearning. Since, the higher frequency of low rewards
can over weigh a low frequency of high reward, it is ensured that random actions are not executed too
frequently.
The state value function V(b) is estimated instead of state-action value function ,i.e Q(b,a). This has
twofold benefits. One, it reduces the computational complexity since the Q(b,a) would require maintain-
ing multiple CMACs, one for every kind of action. And second, by estimating V(b), one can assimilate
the experience and capture the interaction of the multiple robots effectively.So, in a way, the method
tries to estimate a goodness value of every belief state and guiding the robots towards the belief states
with higher goodness value.
The immediate feedback of a (state, action) pair is incorporated with the help ofone step look-ahead in
step (2) of the algorithm.
3.5 Simulation Snapshots
Different stages of the algorithm are shown in the snapshots. 3.4 shows the initial set of hypotheses
of the robot obtained after the initial scan. Each green cluster denote a virtual position or hypothesis. 3.5
shows the sensor scan from robot’s each virtual position. The area in green shows theseenarea within
the sensor range. The boundaries in the red are the frontiers, which seperate theseenfrom theunseen.
The next position to move to, is chosen from the frontier. 3.6 shows the actionwith the highest Q value
being executed. All hypotheses are moved by the distance with respect to the frontier corresponding to
the selected action. The yellow point denotes the actual position of the robot.On executing the action,
the robot detects other robots and is able to make his belief state unimodal. The robot is hence localized.
3.6 Experiments and Results
This section describes the different kind of tests performed and their corresponding results and ob-
servations.
25
Figure 3.4 Initial hypotheses of the robot
Figure 3.5Sensor scan is obtained from each virtual position
3.6.1 On different kind of maps
The method was tested on different kind of maps to verify the proof of concept. These maps were
designed in order to test the algorithm in practical situations. The learning pattern for different kind of
maps as shown in figures 3.8 and 3.9.
In figure 3.8, if the robots are present if the robots are present in the section A which is on the top
left half of the map, intuitively the best action to be taken is south, since it leadsto the unqiue features
of the map that would help to localize in the absence of the other robots. On the other hand, if all robots
are guided to south, the probabilty of robot-robot detections is also high. The algorithm learns to choose
the action south if the robot is in any of the cubicles in the map sub-section A.
In section B of figure 3.8 that has no unique features the algorithm learns tomove the robots in such a
direction (shown in the figure) that there are more robot-robot detections. In cases, where there were
not enough robots in the opposite row of cubicles, the robots learn to move towards the sides.
26
Figure 3.6Execute the decided action.
Figure 3.7Robot localized (Only one hypothesis left)
In section C of figure 3.8 , the robots clearly learn to move towards the obstacles. It is obvious due to
the fact that there is a unique feature outside every cubicle which surpasses the gain due to robot-robot
detections.
In section D of figure 3.8, the learning pattern is the most intuitive. The robotsfirst move towards the
obstacles, i.e north for the upper row of cubicles and south for the lower row. This is because of the huge
gain yielded by the action towards the obstacles when the robot was in any one of the corner cubicles
and got instantly localized. If the robot still remains unlocalized, it moves towards the space between
two cubicles in the hope of finding other robots. If it still remains unlocalized,it moves sideways to get
localized.
The above pattern was verified by a testcase consisting of 7 robots, one inevery row of cubicles. The
corresponding pattern for every section stated above was observed.
As it can be seen in the map shown in figure 3.9, localization is challenging since there are almost no
27
Figure 3.8Map A (arrows show the direction preferred by the robots)
unique map features to aid it. Hence it serves as an ideal platform to test the algorithm. The robots
learn to move in directions shown by the arrows. The space between the cubicles is where there is
highest probability of having robot-robot detections. When there are noother robots, the frontiers on
the sides are selected which lead the robots to the end of the row where they get localized with the help
of the wall. The similar pattern has been observed in Map A (Figure 3.8). Since the learning pattern
is consistent over different kind of maps, it can be safely assumed that the method would work on all
general maps.
3.6.2 With different number of robots
The second test performed was to show the flexibility of the framework to infer the results for differ-
ent number of robots without re-training. The framework, if trained for an appropriate number of robots
can be used for a varied range of number of robots without re-training each time. To prove this Map
B (Figure 3.9) was trained with test cases varied from 20 robots (one in each cubicle) to 1 robot. The
objective is to train the framework to look out for map features when there are no robots around to aid
localization and move to locations of higher robot-robot detections, as a default measure. The former is
achieved by training it extensively with lesser number (1) of robots and thelater by training with large
number of robots (20). The testing is done with a range of number of robotsand average number of
iterations required to localize a robot were recorded. Since the number ofiterations depend a lot on the
intial position of the robot, ( for e.g, w.r.t Map B, the best case for 2 robotsis, they being in the corner
and opposite cubicles), the worst and best case for a certain number ofrobots were considered and their
average was calculated.
The trend is plotted in the graph (Figure 3.10). The graph shows that on anaverage the number of
iterations required to localize a robot decreases with the increase in the number of robots. This shows
28
Figure 3.9Map B (arrows show the direction preferred by the robots)
that the framework learns to take advantage of the presence of other robots to improve its belief. On
the other hand, it does not get handicapped in the absence of other robots and uses the map features
available to localize. This as a matter of course, also proves that the framework need not be re-trained
everytime the number of robots (that need to be localized) changes. The training sample space should
consist of test cases with appropriate number of robots after which the framework can handle the entire
range of robots in between.
Figure 3.10Trend showing the average number of localization iterations Vs Number of robots
29
3.6.3 With different cell sizes
The localization performance was analyzed by changing the cell sizes of the framework. Intuitively,
it can be infered that, a lot of parameters depend on the cell size, for e,g,thesupportof the belief state,
the reward computation, sensitivity of V(b) to the negative feedback etc. Table 3.1 shows the effect
of cell size on the framework parameters for a robot’s position shown in 3.11 for one of the iterations.
The robot in this position, has two hypotheses (H1 and H2), each of whichhas two frontiers ( F11, F12
and F21 , F22) as shown in Figure 3.12. The reward is calculated for transition to every frontier by
calculating the entropy difference between the present belief state and thenext belief state (when at the
frontier ,i.e b′
). Similarly the field V(b′
) in table 3.1 shows how good is this next belief state. Table 3.2
show the final Q-values calculated using equation 2.38 for the actions and the action selected finally.
Figure 3.11The position of the robot refered to in table 3.1
As seen in the table 3.1, for the same position but different cell sizes, rewards and V(b′
) changes.
This in turn changes the Q-values corresponding to actions as shown in 3.2. This affects the action
selected by the robot at this iteration. On a collective basis, the action selected by all robots may or may
not change, and they might result in detecting or not detecting each other.Since robot-robot detections
play a crucial role in active localization, the localization performance gets directly effected with the
change in cell sizes.
Confirming that localization performance is effected by the change in cell size, the next step is to
analyze the relation between the both. The localization performance ( PL) is defined as the reciprocal of
average number of iterations required to localize one robot. The localization performance with respect
to cell sizes is shown for two maps in graphs 3.15 and 3.14. The localization performance is also
compared with respect to their random walk performances. Random walk isimplemented by taking
random actions at every belief state and ignoring the feedback rewards. The action selection in a random
walk does not depend on rewards and hence on the cell size, but nextbelief state estimation due to the
30
Position:(225,150)Cell Size(width× length) Frontier Reward V(b
′
)87× 85 F11 0.99 1.44
F21 0.99 1.42F12 0.024 1.47F22 0.048 1.46
116× 102 F11 0.99 0.766F21 0.99 0.777F12 0.024 0.825F22 0.048 0.75
139× 127 F11 0.99 0.804F21 0.99 0.784F12 0.99 0.804F22 0.048 0.788
174× 170 F11 0 0.407F21 0 0.383F12 0 0.417F22 0 0.412
232× 203 F11 0 0.242F21 0 0.331F12 0 0.265F22 0 0.253
Table 3.1For robot position (225,150) as shown in 3.12, rewards and V(b′
) with respect to the cell size.
Position:(225,150)Cell Size(width× length) Q-values Action selected87× 85 (0, 2.27, 1.355, 0) 1116× 102 (0, 1.68, 0.745, 0) 1139× 127 (0, 1.71, 1.235, 0) 1174× 170 (0, 0.355, 0.373, 0) 2232× 203 (0, 0.258, 0.233, 0) 1
Table 3.2For robot position (225,150) as shown in 3.11, Q-values and action selected with respect tothe cell size.
31
Figure 3.12Hypotheses and frontiers of the robot refered to in table 3.1
Cell Size(width× length) CMAC Tilings PL (Policy learning) PL (RandomWalk)87× 85 128 0.429 0.24116× 102 64 0.375 0.273139× 127 64 0.273 0.188174× 170 32 0.33 0.316232× 203 16 0.375 0.353
Table 3.3Localization performance with respect to change in cell sizes for Map A asshown in figure3.11
execution of an action depends on the cell size. The corresponding results are tabulated in 3.5 and 3.4
respectively.
PL =1
Iterationslocalization
(3.3)
The general trend for both the maps indicates that for a certain map, the localization performance
reaches its optimal for a certain cell size. And generally the cell size for which the performance reaches
its peak, is generally the small in size. Ostensibly, the reason for this is, a finer grid captures the
probability distribution more accurately than the coarser one and hence learn better.
32
Figure 3.13Localization performance with respect to cell sizes
Cell Size(width× length) CMAC Tilings PL (Policy learning) PL (RandomWalk)87× 85 128 0.333 0.24116× 102 64 0.375 0.261139× 127 64 0.35 0.269174× 170 32 0.259 0.25232× 203 16 0.241 0.175
Table 3.4Localization performance with respect to change in cell sizes for Map B
3.6.4 With different kind of clustering
Since the framework concentrates on using hierarchical belief states, themethod of clustering is
significant. The clustering should be such that the coarser (cell level) belief state should represent the
information in the finer belief state (grid level) with minimum loss of information. The method of clus-
tering discussed in chapter 2 involves finding the centroid for a cluster of grids, calculating the weighted
probability for the centroid and assigning it to the cell the centroid lies in. (Refer equation 2.25). Lets
call it clustering methodA for convention.
This method though computationally desirable might lead to erroneous cell assignment. Since the cen-
troid will always bepolarisedtowards the part of the cluster with grids of higher probabilities and the
information contained in the grid with lower probabilities are lost completely. It might make more sense
to assign the aggregate probabilities of the grids which lie in a certain cell rather than trying to constrain
33
Figure 3.14Trend showing the trend of localization performance Vs cell size (Map B)
the whole cluster information to just one point (centroid). Hence for this clustering method, the cell
belief state estimation becomes (i.e, equation 2.25),
BCell(si) =
∑
Grids∈siProbability of the grid
∑
∀Grids Probability of the grid(3.4)
BCell = [BCell(s1), BCell(s2)....BCell(sn)] (3.5)
Let this be clustering methodB.
It can be deduced that this type of clustering with increase thesupportof the belief states thereby im-
proving the navigation policy. The localization performance for a given map(Map 3.11) using clustering
method B was compared with clustering method A and with different cell sizes.
3.6.5 Comparison with centralized method
This method being a semi-distributed method is compared with [4] and [15] for complexity and
performance. It was found out that the [4] being a centralized method gives an optimal performance
but suffers the disadvantages of any centralized method in terms of complexity. Shown in figure 3.16,
is shown the comparison in performance with aforementioned centralized method. As it can inferred
from the figure, the centralized method gives an optimal performance whereas this method being semi-
distributed is able to give a sub-optimal performance.
34
Cell Size(width× length) CMAC Tilings PL (Clustering method A) PL (Clustering method B)87× 85 128 0.43 0.46116× 102 64 0.38 0.38139× 127 64 0.273 0.33174× 170 32 0.33 0.375232× 203 16 0.375 0.375
Table 3.5Localization performance with respect to change in cell sizes and clustering methods(Map A)
Figure 3.15Localization performance Vs Clustering methods for various cell sizes (Map A)
On the other hand, the complexity analysis showed that semi-dsitributed is a moreviable option as
compared to the centralized, since the later is exponential with respect to number of robots, whereas the
semi-distributed is polynomial with respect to number of cells, number of robotsand number of robots
detected.
3.7 Observations
This section describes some observations made during experiments or in general about the framework
or the problem tackled.
35
Figure 3.16Number of iterations to localize/robot compared for semi-distributed and centralized meth-ods
3.7.1 Self balancing property of the framework
An interesting observation was made regarding the rewards that were used as feedbacks for the
CMAC. The reward is calculated as the information gain which is basically the difference in entropies.
For actions which yield no increase in the information gain, the difference in entropies is a small negative
number instead of zero. According to the property of Markov localization which assumes that there are
no negative information gains, the negative rewards should be set to zero and fed back to the CMAC.
But if this quantity is fed back in its pristine form, it acts as negative feedbackthat prevents the robots
to get stuck in a prolonged and recursive loop of actions not leading to any progress.
One such situation has been shown in figure 3.17. The robot starts from state S4, executes an action0
and transits to stateS5. The robot then executes the action1 and ends up in stateS4. It remains stuck
in this loop for 4 iterations, until action3 is selected and the robot transists toS6and localizes. The
corresponding Q(a) values have been plotted in figure 3.18. With reference to the graph in Figure 3.18,
at time instant 1, the robot is in state S4 and Q(0) is greater than Q(3). Hencethe robot selects the action
0 and transists to state S5. At time instant 2, the robot is in state S5 and executesthe action 1 to reach
back the state S4. At time instant 3, the difference in the Q values of action 0 and 3 decreases. But Q(0)
is still greater than Q(3). Hence action 0 is selcted again. At time instant 5,the robot is in state S4 and
Q(3) finally surpasses Q(0). Action 3 is executed the robot reaches thestate S6 where it localizes. This
happens because of the negative feedback that gets added during thetransitions between S4 and S5.
36
Figure 3.17State transitions (red numbers above the arrows indicate the action taken)
Figure 3.18Variation in the Q(a) behavior with respect to the state transition
37
Chapter 4
Transfer Learning
4.1 Problem statement
Given two robots with different sensor ranges. One of the robots learn a policy to perform active
localization in a given map. The policy should be transferred to the other robot inspite of different
sensor ranges, such that it does not have to learn to localize in the map from the scratch.
4.2 Related Work
Pioneer work in active localization has been done by [8]. Fox et.al used an entropy minimization
methodology to reduce the uncertainity of robots pose in a grid based map. Anoptimal method for
co-ordinated active localization for multiple robots was proposed by [4] and [15]. A POMDP based
framework was suggested by [18] for mobile robot navigation in an indoorenviornment. A belief state
MDP was used by [19] to perform co-operative active localization for multiple robots. The idea of trans-
ferring the knowledge for an active localization task can be toyed with. Its immediate advantage would
be that if the target robot is identical to the source robot, re-training fromthe scratch can be avoided for
a given map. The other useful application that can be thought of is transferring such knowledge across
heterogenous robots, with different sensor ranges, say.
Earliest motivation behind transfer learning was enabling the robot to learnindefinitely. This idea was
discussed in detail in [20] by Thrun and Mitchell. They show the transfer of knowledge across the similar
kind of tasks, assuming that the robot and its sensor capabilities remain the same, while the environment
and goals might slightly alter across tasks. [21] used the same method of EBNN(Explanation-Based
Neural Networks) to recognize the new objects in the already familiar environment. A lot of transfer
learning has been used in robot navigation. [22] describe relational policies for transfers in spatial envi-
ronments and demonstrate significant learning improvements. [23] introducedskills, which collapse a
sequence of actions into one single operation and hence find reusable structural information in the map
for navigation. The transfer learning across dissimilar state and action spaces(domains) finds applica-
tion in robo-soccer kind of applications. [24] present an algorithm that summarizes a learned policy
38
into rules that are leveraged to achieve faster learning in a target task in another domain. Similar work
to transfer learned policies across dissimilar state and action spaces was presented in [25]. This con-
centrated on transfering policies from one domain to another via neural network based action selectors.
[26] used transfer learning in WILP ( WiFi based Indoor Localization Problem) for regression modelling
using Hidden Markov Models.
The work in this part of the thesis is novel with respect to multiple aspects. Firstly, this is the only work
done on studying transfer across different observation functions. The above cited works have consid-
ered transfer across different combinations of similar, dissimilar action andstate spaces but with the
assumption that the agents involved are identical in nature. Secondly, this is the only work that talks
about using transfer learning in the active localization context. Thirdly, thisis the first time a transfer is
attempted across belief-state MDP(s). The later is attested by the survey paper [27].
4.3 Problem description
Consider a scenario where a system of autonomous robots need to perform a navigation based task
in a given map. The system is a heterogeneous mixture consisting some of the robots with a laser range
finder and others with an array of sonar sensors. The basic difference between a LRF (Laser Range
Finder) and a sonar is its sensing range, as shown in figure 4.1. There are other differences too, but
differences with respect to only the sensing range are considered. A navigation based task requires a
robot to know its global pose in the map. In such a scenario, instead of localizing each robot separately,
it will be more efficient to localize one of them and transfer the knowledge ofhow to do it, to the
rest. This transfer should be consistent across the change in sensor ranges. As seen in the figure 4.1,
the same environment will be preceived differently due to the change in the sensing range. A transfer
should enable the other robots to not require to re-train from the scratch for a given a map. Hence,
the objective is to transfer the knowledge about the task of active localization(where should one move
to localize faster, as described in chapter 1 and 3 ) across robots with different sensor ranges. The
following sections present a preliminary analysis and some results for transfer learning across belief-
state MDP(s).
4.4 Background on Transfer learning
The core idea of transfer is that experience gained in learning to perform one task can help improve
learning performance in a related, but different, task. The insight behind transfer learning (TL) is that
generalization may occure not only within tasks, but also across tasks. Transfer techniques assume
varying degrees of autonomy and make many different assumptions. To befully autonomous, an RL
transfer agent would have to perform all of the following steps:
1. Given a target task, select an appropriate source task or set of tasks from which to transfer.
39
Figure 4.1Difference in sensing ranges of Sonar and Laser Range Finders
2. Learn how source task(s) and target tasks are related
3. Effectively transfer knowledge from source task to target task.
Currently, no TL methods are currently capable of robustly accomplishing all three goals.
For every kind of transfer one requires to know the following parametersor dimensions:
1. Task difference assumption: What is/are parameters similar and dissimilar between the source
and the target tasks. Example of things that might differ include system dynamics (transition
function), or different set of states and actions etc etc.
2. Is there one source task or many ? Is the source task(s) for transfer already selected or needs to
be selected ? This is mainly human-intervened task and should be done as accurately as possible
to prevent negative transfer.
3. Task mappings: Knowing how are a source and target task related ? Mappings can be manually
provided by human or found computationally.
4. Transferred knowledge: What is the kind of knowledge transferred ? Is is the entire policy or
partial policy ? Or the rewards ? Or a set of actions ? Different kind of knowledge may transfer
better or worse depending on task similarity.
5. Does the TL method place restrictions on what RL algorithm can be used ? Some TL methods
require that the source and target be learned with the same method. Others allow a class of
methods to be used in both tasks.
TL methods are particularly relevant in MDPs that have a large or continuousstate, as these are the
problems which are slow to learntabula rasaand for which transfer may provide substantial benefits
. [[27]] gives a detailed account of different kind of dissimilarities between the source and target tasks
(same state-action space and different transition functions, for e.g) andtransfer techniques that can be
used for aforementioned cases. The particular subset of problems thatwe will concentrate on isone
40
Allowed Task Differencesa action set may differp problem space may differ (agent space must be identical)r reward function may differsi the start state may changesf goal state may movet transition function may differv state variables may differ# number of objects in state may differ
Table 4.1Key for Allowed task differences, with reference to table 4.6
source task to one target task, with same state-action space, since the transfer scenario in our problem
falls into that category. The terms tasks and MDP(s) will be used interchangeably in this chapter.
In one of the earliest TL works for RL, Selfridge et al [28] demonstrated that it was faster to learn to
balance a pole on a cart by changing the tasks transition function,T, over time. The learner was first
trained on a long and light pole. Once it successfully learned to balance thepole the task was made
harder: the pole was shortened and made heavier. The total time spent training on a sequence of tasks
and reusing the learned function approximator (Q-values) was faster than training on the hardest task
directly. [29] uses the same technique of incremental learning but achieves it by moving the stateinitial
further and further away from the goal task for a maze-solving like problem.
Transfer between tasks in which only the reward function can differ areconsidered by [30], [33] and
[31]. [30] method successfully transfers a locally weighted regressionmodel of the transition function,
which is learned in a source task, by directly applying it to a target task. [33] uses a technique called
compositional learningand breaks down a single task into a series of temporally seperated smaller tasks.
[31] proposed a model of Hierarchical Bounded Parameter SMDP, so that partial policy can be trans-
ferred from source task to the target task.
Other works done to perform transfer over tasks with combination of different reward, transition func-
tions and states can be refered to in table 4.6 cited from [27]. The details of the cited works can be found
in the Bibliography. Many metrics to measure the benefits of transfer are possible:
1. Jumpstart: The initial performance of an agent in a target task maybe be improved by transfer
from a source task.
2. Asymptotic performance: The final learned performance of an agentin the target task maybe
improved via transfer.
3. Total Reward: The total reward accumulated by an agent (i.e area under the curve ) maybe im-
proved.
41
Transferred KnowledgeA an action setfea task featuresI experience instancesmodel task modelπ policiesπp partial policies (e.g., options)pri distribution priorspvf proto-value functionQ action-value functionR shaping rewardrule rules or advicesub subtask definitions
Table 4.2Key for Transferred Knowledge, with reference to table 4.6
Allowed Task Differencesall all previously seen tasks are usedh one source task is used (human selected)lib tasks are organized into a librarymod a human provides a source task that the agent automatically modifies
Table 4.3Key for Source Task Selection, with reference to table 4.6
4. Transfer ratio: The ratio of the total reward accumulated by the transfer learner and the total
reward accmulated by the non-transfer learner.
5. Time to threshold: The learning time needed by the agent to achieve a pre-specified performance
level maybe be reduced via knowledge transfer.
4.5 Transfer scenario in our case
Depending on the what kind of knowledge is transferred across the tasks, the transfers are divided
primarily into two kinds, a structural transfer or experience transfer. The structural transfer in-
volves the transfer of task representation characteristics, like the state orthe action space. The
experience transfer on the other hand is the transfer of experience gathered over the training runs.
This involves the transfer of value function, policy or directly samples.
In our case, the transfer is proposed across the robots with differentsensor ranges. A change in
sensor range is directly coupled to the change in observation function. The observation function
42
Allowed LearnersB Bayesian learnerBatch batch learnerCBR case based reasoningH hierarchical value-function learnerLP linear programmingMB model based learnerPS policy search learnerRRL relational reinforcement learningTD temporal difference learner
Table 4.4Key for Allowed Learners, with reference to table 4.6
TL Metricsap asymptotic performance increasedj jumpstart demonstratedtr total reward increasedtt task learning time reduced
Table 4.5Key for TL Metrics, with reference to table 4.6
is defined as in equation 4.1
O(s′
, a, o) = P (ot+1 = o|st+1 = s′
, at = a) (4.1)
where it describes the probability of getting observationo, given that the agent executed the action
a and landed in the states′
.
The source and target robots have different sensor ranges and hence for the same position, they
will perceivetheir environment differently, i.e the observation functions for the sourceand target
robot is different. Since POMDP is reduced to a MDP over belief states, thebelief-state space is
the same for both the robots, since the location for both the robots is the same. The situation has
been described in the Figures 4.2 and 4.4. The corresponding belief states have been shown in the
section 4.8.
The change in the observation function is modelled as the change in transition and reward func-
tions due to the belief state framework as shown in 4.2 and 2.17.
Pr(B′|B, a) =∑
o,∀B′→f(B,a,o)
Pr(o|B, a) (4.2)
The equation has been derived in section (2) in detail. The reward as shown in equation 2.17 also
directly depend on the observation function. The state space remains the same since the robots
43
Citation AllowedTask Dif-ferences
SourceTaskSelection
TaskMap-pings
TransferredKnowl-edge
AllowedLearners
TLMetrics
[Selfridge et al.(1985)]
t h N/A Q TD tt
[Asada et al. (1994)] si h N/A Q TD tt[Singh (1992)] r all N/A Q TD ap,tr[Atkeson and Santa-maria (1997)]
r all N/A model MB ap,j,tr
[Asadi and Huber(2007)]
r h N/A πp H tt
[Andre and Russell(2002)]
r,s h N/A πp H tr
[Ravindran and Barto(2003b)]
s,t h N/A πp TD tr
[Ferguson and Ma-hadevan (2006)]
r,s h N/A pvf Batch tt
[Sherstov and Stone(2005)]
sf ,t mod N/A A TD tr
[Madden and Howley(2004)]
s,t all N/A rule TD tt,tr
[Lazaric (2008)] s,t lib N/A I Batch j,tr
Table 4.6TL methods for transfer across same state variables and actions. N/A implies the absence ofany task mapping
44
try to localize in the same set of positions. Inspite of being in the same location, thereward
function changes due to the change in belief states estimated by the observations. The change
in the transition function is due to the distance covered by the robot when an action is executed
(Since the robot cannot move beyond it can see, hence is limited by its sensor range). This is in
turn changes the transition probabilities since the probability to land in the next state given the
present state changes. (The robot might fall short and may not move to the next cell but remain in
the same cell even after executing the action).
As explained above, this problem can be modelled as the transfer across different transition and
reward functions. It has been observed that a high-level transfer (rules, policies) are recommended
when reward function is different for the source and target tasks. Whereas a lower-level transfer (
Q values) are recommended when transition function is different for the source and target tasks.
And the source and target belief states have to be mapped before the transfer. It will be clear in
next sections how the algorithm and framework achieve the above.
4.6 Formalism for transfer
Consider two robots, RA which has a high sensor range and RB, which has a lower sensor range.
The robots start from the same initial position but due to the different sensor ranges generate dif-
ferent belief states as shown in figures 4.2 and 4.4. A successful transfer should bias the action
selection in lookahead step, so that the actions which lead to the maximum discounted rewards in
future are selected. Refering to the lookahead step in equation 3.1 in algorithm step 2(i) of section
3, there are two components which decide the action to be executed. One being, the immediate
rewards as seen from a hypothesis and weighted by the probability of being in that hypothesis.
And secondly, the goodness (value function output) for the next belief-state on executing that ac-
tion. Observing the equation 3.1, the second term can be used to bias the Q value. If the function
approximator used to calculate V(b′) already indicates which next belief-state will maximize the
reward in future, the robot can execute the corresponding action.
The transfer also requires the association of the correct source and target belief state. Its quite
plausible that the next belief state b′, for the RB might not be same as the robot RA. This is gen-
erally the case when the robot RB is near an unique feature (for eg, an obstacle) and just short of
it ( Figure 4.1). In such a case, the belief state of robot RA would show a peak in the probability
of cell nearest to the obstacle, whereas robot RB would not. In such a case the property of value
function being a function approximator comes into play. An approximation for goodness value of
the belief state is calculated if the belief state wasnt encountered in the training.There is a good
chance that b′
of RB corresponds to some belief state b′
of robot RA and the set of action which
maximize the discounted rewards in future are similar.
The third requirement of the transfer is to incorporate the new belief states of robot RB in the value
function, so that it is able to guage the goodness of the belief states of both robots. This is done by
45
the update equation 3.2, in step 5(i) of the algorithm in section 3. This results in adjusting the re-
gressor of the value function (implemented by the CMAC) locally by the belief states of robot RB.
A general algorithm to perform such a transfer is as follows: Let M1 = ( S, A, R1, P1 ) and M2 =
( S, A, R2, P2 ) be the two MDP’s
Let KM be the knowledge to be transferred.
Let A( KM ) be the learning algorithm.
A(KM ) = V/Q (4.3)
or
A(KM ) = π (4.4)
Let T( KM ) be the transfer function.
T (KM ) = K′
M (4.5)
Then the transfer process can be formulated as :
(a) Collect KM1from the source task.
(b) Collect KM2from the target task.
(c) Transfer T( KM1| KM2
) = K′
M2
(d) Learn A(KM2∪ K
′
M2)
(e) Evaluate the performance.
To summarize, the algorithm and framework satisfy all the requirements for transfer and hence
can be used with minimal change of initializing the target Q-values with the sourceQ-values and
then learning over them. If the RA has learned well (to significantly reduce the number of local-
ization iterations), the transfer is positive and will lead to improvement in performance of robot
RB.
Conceptually, new belief states and their associated actions and rewards are added to value func-
tion and the existing ones are adjusted with the new knowledge. These steps are repeated until
the robot RB gets localized. Transfer is carried out for many initial positions of the robot in the
map. The performance of the target task is compared with that of the sourcetask. The results are
discussed in detail in the results section.
46
4.7 Algorithm
The framework used is same as described in section (2). The framework isused to train a single
robot to move in the map, such it localizes in the least number of steps. The clustering method B
described in section 3 (equation 3.4 is used for transfer. Since there areno robot-robot detections
, the localization is solely influenced by the map features. The algorithm shouldhave two phases,
one for training the source robot and second for transferring the policy to the target robot. The
pseudocode for the training phase is as shown:
(a) Make observations. Calculate the belief stateb
(b) For every action
i. For all hypotheses of the robot
A. Calculate Q(a)
Q(b, a) = Q(b, a) +∑
n∈N
b(sn) ∗ (Rlookahead + γV (b′
)) (4.6)
(c) Execute the action with highest value of Q(b,a)
(d) Update V(b) with the reward
V (b) = α(R + γ × V (b′
) − V (b)) + V (b) (4.7)
(e) Repeat until the robot is localized
The algorithm starts with the robot acquiring the sensory readings of the surroundings and deter-
mining its belief state. Every non zero probability value in the belief state represents the hypothe-
sis about the robots pose and how probable the hypothesis is. A one-steplookahead is performed
to find the action which is more likely to lead to highest average gain for all hypotheses. Hence
the most favourable action is executed and reward is calculated. This reward acts as a feedback
to goodness value of that belief state. This algorithm is iterated until the robotgets localized.
The training is carried out for many initial poses of the robot. To avoid gettingstuck in the local
minimas at step (3),ǫ-greedy exploration strategy is used with the value ofǫ being 0.1 . One could
use strategies like softmax action selection butǫ-greedy is computationally convenient. As found
empirically, a total of around 120 epochs are sufficient for the MDP to converge. A CMAC with
128 tilings each with 10× 10 tiles is used for approximation of the value function.
4.8 An example
Consider two robots, RA which has a high sensor range and RB, which has a lower sensor range
as shown in figures 4.2 and 4.4.
47
Figure 4.2Robot RA with a high sensor range
Figure 4.3The belief state for the Robot RA with a high sensor range
The belief states for both the robots are shown in figures 4.3 and 4.5.
For a 6× 10 ( 6 rows and 10 coloumns) cell framework, the belief state for the robotsare :
B(RA) =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0.068 0.112 0.194 0.116 0.084 0.214 0.097 0.116 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
(4.8)
48
Figure 4.4Robot RB with a low sensor range
Figure 4.5The belief state for the Robot RB with a low sensor range
and
B(RB) =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0.067 0.105 0.189 0.108 0.0821 0.200 0.094 0.108 0
0 0 0 0 0 0 0 0 0 0
0 0 0.0072 0.0072 0.0079 0.0033 0.011 0 0.0079 0
0 0 0 0 0 0 0 0 0 0
(4.9)
Refering to the algorithm in the above section, transfer is done by initializing thevalue function
of Robot RB with that of Robot RA. The Q-values are generated for all actions as shown by the
equation 4.6 and the action with highest Q-value is executed. The next beliefstate is calculated
49
Iteration Q-values[ a1, a2, a3, a4] ActionSelected
1 [ 0.197, 0, 0.319, 0.431] a42 [ 1.65, 0.144, 0.517, 0.187] a13 [ 0.09, 0.03, 0.144, 0.126] a34 [ 1.55, 0.91, 1.11, 1.54] a15 [ -0.206, 0.68, -0.66, 0.416] a26 [ -0.59, 0, 0, -0.578] a47 [ 0.48, -0.056, 0.63, -0.899] a3
Table 4.7Q-values for RB with transfer
Iteration Q-values[ a1, a2, a3, a4] ActionSelected
1 [ 0.188, 0, 0.316, 0.418] a42 [ 1.64, 0.136, 0.513, 0.181] a13 [ 0.041, -0.23, -0.09, 0.06] a44 [ 0.162, 0.071, 0.06, 0.141] a15 [ 0, 0.024, 0.101, 0] a36 [ -0.42, 0.14, -0.46, -0.059] a27 [ 0.627, 0, 0, 0.414] a18 [ 0.209, 1.15, 0.130, 0] a29 [ 0.064, 0, 0, -0.814] a110 [ 0.043, 0.084, 0.052, 0] a2
Table 4.8Q-values for RB without transfer
after the action execution. The next belief state for robot RB is as shown:
B′
(RB) =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0.097 0.074 0.149 0.141 0.128 0.064 0.222 0.081 0
0 0 0.004 0.008 0.005 0.008 0.004 0.009 0.005 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
(4.10)
The reward is calculated for this transition and the value function is updated.This constitutes a
single iteration.
As seen in the tables 4.7 and 4.8, the robot RB execute the same action for first two iterations. In
the 3rd iteration, due to the bias of the transferred V(B) values, the Q function attains a higher
50
value fora3 , which leads to faster localization (in 7 steps as compared to without transfer, in 10
steps).
4.9 Experiments and Results
This section presents the empirical results obatined for various experimentsdone with transfer
learning. The results are widely divided into two subsections. The first kind is to show that
the general trend of transfer metrics hold in an active localization task. Thesecond kind are the
observed trends in transfer learning, specific to the task of active localization.
4.9.1 Transfer Metrics
The localization performance (PL) is calculated as :
PL =1
IL
(4.11)
whereIL = Average number of iterations to localize a robot.
Figure 4.6Performance Comparison: (With transfer learning Vs Without transfer learning)
The graph shown in figure 4.6 shows a comparison of localization performance with and without
transfer learning as compared to the increase in the number of training epochs.
51
The MDP without transfer learning was trained from the scratch for the new observation function
(i.e for the different sensor range). The MDP with transfer learning was initialized with the CMAC
weights of the earlier observation function and the new learnings were bootstrapped over them.
The number of training epochs can be analogized as system’s experience, that keeps increasing by
encountering different belief states and corresponding transitions. Asexpected, with the increase
in the experience, the performance of the system improves. The observation is consistent for
both the MDP’s, irrespective of the fact, whether the transfer is performed or not. It was also
observed that performance receives a jumpstart towards the zero endof training epochs in case of
transfer learning. This confirms the intuition in the sense, that the older MDP already ”knows” the
environment and hence can partially steer the robot without any learning.The MDP trained with
transferred knowledge displays a higher learning rate, as the learning curve for the transferred
learning shows a better bias than the bias for a MDP which starts learning from the scratch. The
transfer learning also shows an average increase in the reward due to improvement in the initial
performance and a faster learning rate. This is evident because of the higher area under the curve
for the learning with transfer as compared to learning without transfer.
4.9.2 Task specific trends
Since the transfer learning in this task is primarily observation function based, a comparison in
performance is made with respect to different sensor ranges. The average number of iterations
required to localize is measured with respect to ratio between the source andthe target robot’s
sensor ranges. The graph (4.7) denote the trends. As seen in the graph, the transfer learning
becomes more effective as the ratio moves towards 1. This trend again can be justified, since
transfer learning is expected to be more effective as the source and target tasks become more
similar.
It was also observed that effects of transfer learning were more significant on sparse maps than
on cluttered maps. The two maps used for comparison are shown in figure 4.9.2.
The learning curves for the maps are shown in figure (4.9). It can be observed that the difference
in learning rates with and without transfer learning is higher in sparse maps than that of the clut-
tered maps.
It can be observed that the difference in learning rates with and without transfer learning is higher
in sparse maps than that of the cluttered maps.
The trend can be explained by closely observing the equation (4.6). The number of iterations
required to localize directly depend on action selected at every state, whichin turns depends on
the Q(b,a) [equation: (4.6)]. The calculation of the Q value is composed of two parts. One of
them being the immediate reward Rlookahead, which is obtained by observing the environment
and the other being the goodness value of the next belief state on executingthe actiona, supplied
52
Figure 4.7Performance Vs Ratio of Source and Target sensor ranges of the robot
Figure 4.8Maps: Relatively cluttered map (top) , Sparse Map(Bottom)
by the value function V(b′
). In a cluttered map, the immediate reward component dominates the
equation and allows less room for V(b′
) to manipulate the resulting Q(b,a). An another reason
can be, that for a sparse map, there is relatively less dissimilarity between thebelief state with
a large sensor range and the one with a small sensor range. This being thecase, the association
made by V(b′
) is better and transfer takes place more effectively. Hence, there was less difference
in localization performance between MDP with and without transfer learning for cluttered maps.
However, the effect of transfer was evident in comparison of Q -values. In MDP’s with trans-
ferred V(b) values, the Q value of the ”favourable” action was significantly higher thanthat of
other actions, thereby increasing their probability to get selected for the corresponding belief state.
Unlike, in sparse maps, the lack of unique features result in low reward values hence enabling the
V(b) values obtained by the transfer to domineer the action selection.
53
Figure 4.9Learning curves of Map1 (Relatively cluttered top) Vs Map2 (Sparse Map)
54
Chapter 5
Conclusions
Mobile robot localization is an instance of the general localization problem, which is the most
basic perceptual problem in robotics. Unfortunately, the problem of mobilerobot localization is
that the pose can usually not be sensed directly. In other words, most robots do not possess a
noise-free sensor for measuring pose. The pose therefore has to beinferred from data. So, in
environments which possess relatively few features that enable a robotto unambiguously deter-
mine its location, global localization algorithms can result in multiple hypotheses locations of a
robot. The idea is that by making more observations of features in the environment and matching
these to the hypotheses, the hypothesis corresponding to the true pose ofthe robot will gain most
evidence, making it distinguishable from the other, false, hypotheses. This methodology is called
Active localization.
It has been observed that active localization can be performed more efficiently if multiple robots
are involved. It involves reducing the ambiguity in the position with the help of robot-robot
detection along with the map features. A centralized method to solve this problem involves main-
taining the states/hypotheses of all robots and choosing an action which will yield the highest gain
in terms of localization. This method is expensive in the sense that large numberof computations
required in comparing the hypoetheses of multiple robots and these calculations increasely ex-
poenentially with the increase in the number of robots. An alternative to the above is a distributed
algorithm, where the robots are completely independent. But these methods suffer the bane of
large amounts of message passing.
A learning method and framework is proposed, such that algorithm is semi-distributed in na-
ture, the loose coupling between the robots accelerates the convergenceof algorithm. Also once
trained it is independent of the number of robots. The thesis aims at designing a learning frame-
work which is able to implement active localization for single and multiple robots. The major
problem faced is that of learning huge grid based belief-state spaces soas to avoid localizing in
terms of coarser map units (for eg, in terms of cubicles). To solve this problem, a system of hi-
erarchical belief states are used ,such that learning takes place over a coarse belief-state space ,
whereas the localization occurs at much finer level.
55
The framework was succesfully tested for multi-robot active localization. The test results showed
that the method is an effective way to perform Active localization for multiple robots without
dealing with huge joint state and action spaces. Since the algorithm aims at learning the unique
features in the map, whether be it obstacles or other robots, it performs a kind of feature learning
and hence is scalable to any number of robots or lack of them. That is to say,that the system
doesnt need to be retrained for every change in the number of robots. Once it is trained for an
appropriate number of robots required to capture the basic features of the map, the system would
work for large range of number of robots without re-training.
The framework was also attempted for transferring the knowledge of localizing in a given map
across robots with dissimilat sensor capabilities. In a multi-robot scenario where the robots have
dissimilar sensor capabilities, the knowledge about global localization for a map cannot be reused
and hence the robots have to be re-trained from the scratch. It explores the possibility of transfer-
ring the knowledge about globally localizing in a map from one trained robot toall other robots
using the same map, irrespective of their sensor ranges, thus avoiding thecomputationally ex-
pensive training iterations. The results as discussed show that transferlearning accelerates the
process of learning localization by a significant amount. It also observesthat the transfer depend
largely on the similarity of observation functions of the tasks. It was also observed that transfer
learning is more effective in maps devoid of many features, which is favourable and supports the
need for transfer learning in an global localization task.
The method presented here will find immediate use in various navigation and pathplanning re-
lated problems that involve the use of autonomous heterogeneous robots.
5.1 Limitations and Scope of future work
There are some limitations of this work. Firstly, the assumption is being made, that orientation is
known. Including orientation in the robots’s pose( x, y , θ ), would complicate the belief state
estimation even further. Though inclusion of the orientation in the framework would be a great
asset, it is quite challenging and can be covered in future extensions of thiswork.
The transfer learning across the belief states hasn’t been formulated completely. The results for
the test, provided are empirical and forms the basis of analysis, unlike the formulation of the
framework itself. Complete formulation can be attempted as future part of this work using the
results obtained as a starting point in the the aforementioned line of research.
56
Related Publications
Jyotika Bahuguna, B. Ravindran and K. Madhava Krishna, MDP basedActive Localization for
multiple robots, 2009, Fifth Annual IEEE Conference on Automation Scienceand Engineer-
ing(CASE).
Jyotika Bahuguna, B. Ravindran and K. Madhava Krishna, TransferLearning based Active local-
ization of robots with different sensor capabilities , 2010, ICARCV (Submitted).
57
Bibliography
[1] S Thrun, W Burgard and D Fox Probabilistic ROBOTICS. 2000. MIT Press, Cambridge, Mas-
sachusetts, London.
[2] Roland Siegwart Introduction to Autonomous Mobile Robots 2004. MIT Press, Cambridge, Mas-
sachusetts,USA.
[3] Richard S. Sutton and Andrew G. Barto Reinforcement learning - An introduction 2005. MIT Press,
Cambridge, Massachusetts,London.
[4] Richard S. Sutton and Andrew G. Barto Coordination in ambiguity: coordinated active localization
for multiple robots. 2008. AAMAS (Demos) : 1707-1708
[5] Brian Yamauchi A Frontier-Based Approach for Autonomous Exploration. July 1997, pp. 146-151.
AAMAS (Demos) : 1707-1708
[6] J.S Albus A new approach to manipulator control: The cerebellar model articulation contoller (CMAC)
July 1975 Journal of Dynamic Systems, Measurement and Control,220-227. Trans. of the ASME,
Series G
[7] Rob Schapire,Kostas Tsioutsiouliklis MDP tutorial,Foundations of Machine learning, lecture 23 April
2003 Journal of Dynamic Systems, Measurement and Control,220-227. Trans. of the ASME, Series G
[8] W. Burgard D. Fox and S. Thrun Active Markov Localizationfor Mobile Robots 1998 Robotics and
Autonomous Systems, 25, 195-207
[9] K. Romanik G. Dudek and S. Whitesides Localizing a Robot with Minimum Travel 1998 SIAM J.
Computing 27, no. 2
[10] M.L. Littman L. P. Kaelbing and A.R. Cassandra Planningand Acting in Partially Observable
Stochastic Domains 1995 Technical report, Brown University
[11] A. Cassandra L. Kaelbling and J. Kurien Acting Under Uncertainty: Discrete Bayesian Models for
Mobile Robot Navigation 1996 In Proc. of IEEE/RSJ International Conference on Intelligent Robots
and Systems
[12] M J Mataric A Howard and G S Sukhatme. Putting the I in the Team: Ego-Centric Approach to
Multi-Robotic Localization. 2003 In Proceedings of IEEE International Conference on Robotics and
Automation
58
[13] Emery-Montemerlo, R. and Gordon, G. and Schneider, J. and Thrun,S Approximate Solutions For
Partially Observable Stochastic Games With Common Payoffs. 2004 Proceeding of Autonomous
Agents and Multi-Agent Systems
[14] Mohammad Ghavamzadeh and Sridhar Mahadevan A Multiagent Reinforcement Learning Algo-
rithm by Dynamically Merging Markov Decision Processes 2002 First International Conference on
Autonomous Agents and Multiagent Systems (AAMAS) , Bologna, Italy
[15] Shivudu Bhuvanagiri, K. Madhava Krishna Active globallocalization for multiple robots by disam-
biguating multiple hypotheses 2008 IROS
[16] M. Mataric, Reinforcement learning in multi-robot domain, 1997, Autonomous Robots ,4(1);73-78
[17] Dieter Fox, Markov Localization: A Probabilistic Framework for Mobile Robot Localization and
Navigation, Dec, 1998, Doctoral thesis, University of Bonn, Germany Press,
[18] Sven Koenig, and Reid G. Simmons, A Robot Navigation Architecture Based on Partially Observable
Markov Decision Process Models, 1998, Artificial Intelligence Based Mobile Robotics: Case Studies
of Successful Robot System ,pp. 91122,
[19] Jyotika Bahuguna, B. Ravindran and K. Madhava Krishna,MDP based Active Localization for mul-
tiple robots, 2009, Fifth Annual IEEE Conference on Automation Science and Engineering(CASE),
[20] S. Thrun and T. Mitchell, Lifelong Robot Learning, 1995, Robotics and Autonomous Systems
15:2546,
[21] J. OSullivan, T. Mitchell and S. Thrun, Explanation-Based Learning from Mobile Robot Perception,
1997, Symbolic Visual Learning, Editors: K. Ikeuchi & M.Veloso. Oxford University Press,
[22] Lane, T., Wilson, A., Toward a topological theory of relational reinforcement learning for navigation
tasks , 2005, Symbolic Visual Learning, Editors: K. Ikeuchi& M.Veloso. Oxford University Press,
[23] Thrun, S., Schwartz, A, Finding structure in reinforcement learning, 1995, Advances in Neural
Information Processing Systems, vol. 7,
[24] Goyal, R.K., Egenhofer, M.J, Consistent queries over cardinal directions across different levels of
detail, 2000, Proceedings of the 11th International Workshop on Database and Expert System Appli-
cations, pp. 867880,
[25] Matthew E. Taylor, Shimon Whiteson, and Peter Stone, Transfer via Inter-Task Mappings in Policy
Search Reinforcement Learning, 2007, The Sixth International Joint Conference on Autonomous
Agents and Multiagent Systems(AAMAS),
[26] Sinno Jialin Pan, Vincent Wenchen Zheng, Qiang Yang andDerek Hao Hue, Transfer Learning for
WiFi-based Indoor Localization, July 2008, In Proc. of AAAI-08 Workshop on Transfer Learning for
Complex Task. Chicago, Illinois, USA,
[27] Matthew E. Taylor, Peter Stone, Transfer Learning in Reinforcement learning domains, 2009, In
European Conference of Machine Learning ,
[28] Selfridge, O. G.; Sutton, R. S.; and Barto, A. G., Training and tracking in robotics, 1985, In
Proceedings of the Ninth International Joint Conference onArtificial Intelligence, 670672,
59
[29] Asada, M.; Noda, S.; Tawaratsumida, S.; and Hosoda, K.,Vision-based behavior acquisition for a
shooting robot by using a reinforcement learning, 1994, In Proc. of IAPR/IEEE Workshop on Visual
Behaviors,
[30] Atkeson, C. G. and J. C. Santamaria, A Comparison of Direct and Model-Based Reinforcement
Learning, 1997, International Conference on Robotics and Automation,
[31] Mehran Asadi and Manfred Huber, Effective control knowledge transfer through learning skill and
representation hierarchies, 2007, In Proceedings of the 20th International Joint Conference on Artifi-
cial Intelligence, pages 20542059,
[32] Alexander A. Sherstov and Peter Stone, Improving action selection in MDPs via knowledge transfer,
2005, In Proceedings of Twentieth National Conference on Artificial Intelligence,
[33] Michael Kearns and Satinder Singh, Near-optimal reinforcement learning in polynomial time, 1998,
In Proc. 15th International Conf. on Machine Learning, pages 260268. Morgan Kaufmann, San Fran-
cisco, CA,
[34] David Andre and Stuart J. Russell, State abstraction for programmable reinforcement learning agents,
2002, In Proc. of the Eighteenth National Conference on Artificial Intelligence, pages 119125,
[35] Bikramjit Banerjee, Yaxin Liu, and G.Michael Youngblood, ICML workshop on Structural knowl-
edge transfer for machine learning, 2006, In ICML,
[36] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein, The complexity of
decentralized control of Markov decision processes , November,2002, Mathematics of Operations
Research, 27 (4):819840,
[37] Kimberly Ferguson and Sridhar Mahadevan., Proto-transfer learning in Markov decision processes
using spectral methods, June,2006, In Proceedings of the ICML-06 Workshop on Structural Knowl-
edge Transfer for Machine Learning,
[38] Matthew E. Taylor, Peter Stone, Cross-Domain Transferfor Reinforcement Learning, 2007, In
Proceedings of the 24th international conference on Machine learning ,
[39] Lutz Frommberger, Generalization and Transfer Learning in Noise-Affected Robot Navigation Tasks
[40] Funlade T. Sunmola,Jeremy L.Wyatt , Model Transfer forMarkov Decision Tasks via Parameter
Matching, WORKSHOP OF THE UK PLANNING AND SCHEDULING SPECIAL INTEREST
GROUP, 2006,
[41] Vishal Soni, Satinder Singh , Using Homomorphisms to Transfer Options across Continuous Rein-
forcement Learning Domains, Proceedings of the 21st national conference on Artificial intelligence -
Volume 1 , 2006,
[42] Daniel S. Bernstein , Reusing Old Policies to Accelerate Learning on New MDPs, Technical Report:
UM-CS-1999-026 , 1999,
[43] George Konidaris , A Framework for Transfer in Reinforcement Learning, In Proceedings of the
ICML-06 Workshop on Structural Knowledge Transfer for Machine Learning , 2006,
60
[44] Shai Ben-David, Reba Schuller , Exploiting Task Relatedness for Multiple Task Learning , In Pro-
ceedings of the 25th Workshop of the UK Planning and Scheduling Special Interest Group (PlanSIG
2006). ,December, 2006,
61