A Distributed and Morphology-Independent Strategy for ...

23
A Distributed and Morphology-Independent Strategy for Adaptive Locomotion in Self-Reconfigurable Modular Robots David Johan Christensen a,* , Ulrik Pagh Schultz b , Kasper Stoy b a Department of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark b The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark Abstract In this paper, we present a distributed reinforcement learning strategy for morphology-independent life- long gait learning for modular robots. All modules run identical controllers that locally and independently optimize their action selection based on the robot’s velocity as a global, shared reward signal. We evaluate the strategy experimentally mainly on simulated, but also on physical, modular robots. We find that the strategy: (i) for six of seven configurations (3-12 modules) converge in 96 % of the trials to the best known action-based gaits within 15 minutes, on average, (ii) can be transferred to physical robots with a comparable performance, (iii) can be applied to learn simple gait control tables for both M-TRAN and ATRON robots, (iv) enables an 8-module robot to adapt to faults and changes in its morphology, and (v) can learn gaits for up to 60 module robots but a divergence effect becomes substantial from 20-30 modules. These experiments demonstrate the advantages of a distributed learning strategy for modular robots, such as simplicity in implementation, low resource requirements, morphology independence, reconfigurability, and fault tolerance. Keywords: Self-reconfigurable Modular Robots, Locomotion, Online Learning, Distributed Control, Fault Tolerance 1. Introduction Self-reconfigurable modular robots have a flexi- ble morphology. The modules may autonomously or manually connect to form various robot configu- rations. Each module typically has its own micro- controller and can communicate locally with con- nected, neighbor modules. These characteristics en- tail that conventional centralized control strategies are often not suited for modular robots. Instead, we apply distributed control strategies which in gen- eral may have several advantages over centralized strategies in terms of ease of implementation, ro- bustness, reconfigurability, scalability, and biologi- cal plausibility. In this paper we focus on online learning suit- able for modular robots, see Fig. 1. Such a learn- ing strategy must require a minimal amount of re- * Corresponding author Email addresses: [email protected] (David Johan Christensen), [email protected] (Ulrik Pagh Schultz), [email protected] (Kasper Stoy) sources and ideally be simple to implement on a dis- tributed system. Furthermore, the number of dif- ferent robot configurations is in general exponential with the number of modules. Even with few mod- ules, it is intractable to manually explore the whole design space of robot morphologies and controllers by hand. Further, modular robots may decide to self-reconfigure or be manually reconfigured, and modules may realistically break down or be reset. Hence, the learning strategy must be robust and able to adapt to such events. Therefore, we are exploring morphology-independent learning, where we do not tie the learning strategy and underly- ing gait generator to a specific robot morphology. This strategy is different from most related work on monolithic robots, where the learning space can be reduced, e.g., from kinematic models or by exploit- ing symmetries. We anticipate that such attractive features can be obtained by utilizing a simple, distributed learn- ing strategy. We let each module locally optimize its behavior independently and in parallel with the Preprint submitted to Robotics and Autonomous Systems Published September 2013

Transcript of A Distributed and Morphology-Independent Strategy for ...

Page 1: A Distributed and Morphology-Independent Strategy for ...

A Distributed and Morphology-Independent Strategy for AdaptiveLocomotion in Self-Reconfigurable Modular Robots

David Johan Christensena,∗, Ulrik Pagh Schultzb, Kasper Stoyb

aDepartment of Electrical Engineering, Technical University of Denmark, Kgs. Lyngby, DenmarkbThe Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark

Abstract

In this paper, we present a distributed reinforcement learning strategy for morphology-independent life-long gait learning for modular robots. All modules run identical controllers that locally and independentlyoptimize their action selection based on the robot’s velocity as a global, shared reward signal. We evaluatethe strategy experimentally mainly on simulated, but also on physical, modular robots. We find thatthe strategy: (i) for six of seven configurations (3-12 modules) converge in 96 % of the trials to the bestknown action-based gaits within 15 minutes, on average, (ii) can be transferred to physical robots witha comparable performance, (iii) can be applied to learn simple gait control tables for both M-TRAN andATRON robots, (iv) enables an 8-module robot to adapt to faults and changes in its morphology, and (v)can learn gaits for up to 60 module robots but a divergence effect becomes substantial from 20-30 modules.These experiments demonstrate the advantages of a distributed learning strategy for modular robots, suchas simplicity in implementation, low resource requirements, morphology independence, reconfigurability, andfault tolerance.

Keywords: Self-reconfigurable Modular Robots, Locomotion, Online Learning, Distributed Control, FaultTolerance

1. Introduction

Self-reconfigurable modular robots have a flexi-ble morphology. The modules may autonomouslyor manually connect to form various robot configu-rations. Each module typically has its own micro-controller and can communicate locally with con-nected, neighbor modules. These characteristics en-tail that conventional centralized control strategiesare often not suited for modular robots. Instead, weapply distributed control strategies which in gen-eral may have several advantages over centralizedstrategies in terms of ease of implementation, ro-bustness, reconfigurability, scalability, and biologi-cal plausibility.In this paper we focus on online learning suit-

able for modular robots, see Fig. 1. Such a learn-ing strategy must require a minimal amount of re-

∗Corresponding authorEmail addresses: [email protected] (David Johan

Christensen), [email protected] (Ulrik Pagh Schultz),[email protected] (Kasper Stoy)

sources and ideally be simple to implement on a dis-tributed system. Furthermore, the number of dif-ferent robot configurations is in general exponentialwith the number of modules. Even with few mod-ules, it is intractable to manually explore the wholedesign space of robot morphologies and controllersby hand. Further, modular robots may decide toself-reconfigure or be manually reconfigured, andmodules may realistically break down or be reset.Hence, the learning strategy must be robust andable to adapt to such events. Therefore, we areexploring morphology-independent learning, wherewe do not tie the learning strategy and underly-ing gait generator to a specific robot morphology.This strategy is different from most related work onmonolithic robots, where the learning space can bereduced, e.g., from kinematic models or by exploit-ing symmetries.

We anticipate that such attractive features canbe obtained by utilizing a simple, distributed learn-ing strategy. We let each module locally optimizeits behavior independently and in parallel with the

Preprint submitted to Robotics and Autonomous Systems Published September 2013

Page 2: A Distributed and Morphology-Independent Strategy for ...

Figure 1: Illustration of concept. The minimal strat-egy demonstrated in this paper is: (i) distributed, sincethe modules learn independently and in parallel, and(ii) morphology-independent, since the robot may self-reconfigure, modules may break down, be removed or addedto the robot.

others based on a single, broadcast reward signalwhich indirectly is used to optimize the global gaitof the robot. The learning is life-long in the sensethat there is no special learning phase followed byan exploitation phase. We hypothesize that a dis-tributed strategy may be more robust and flexi-ble since it can be indifferent to the robot’s mor-phology and can adapt online to module failuresor morphology changes. Ultimately, we anticipatethat by studying distributed learning, we may gaininsights into how adaptive sensory-motor coordina-tion, such as locomotion patterns, can emerge andself-organize in animals.

1.1. Summary of Contributions

This paper contributes an experimental study ofa distributed online learning strategy for modularself-reconfigurable robots. The proposed strategy isminimalistic and distributed, and we present exper-iments both in simulation and on physical robots,that demonstrate that the strategy is morphologyindependent, fault-tolerant, generalizes to severalmodule types, and is able to learn effective locomo-tion gaits in a matter of minutes. We also illustratethe limitations of the strategy, how it is not appro-priate for all robot morphologies, and how it haslimited scalability. This paper provides a compre-hensive report of our previous work (Christensenet al., 2008a, 2010b) and in addition presents amore exhaustive survey of related work, includingmore analysis of the strategy, and additional experi-ments with gait-table learning, tolerance of module

loss, and scalability analysis which are previouslyunpublished.

1.2. Outline

The paper is organized as follows. First, we re-view related work in Sect. 2. Then we describe thelearning strategy and how it can be applied to learnboth action-based controllers as well as gait controltables (Sect. 3). The experimental platforms andthe experimental setups are described in Sect. 4 andSect. 5 respectively. Further, a number of experi-ments are presented in Sect. 6. Finally, we discussthe results obtained in Sect. 7 and give conclusionsin Sect. 8.

2. Related Work

2.1. Self-Reconfigurable Modular Robots

The concept of modular self-reconfigurablerobots was introduced in the late 80’s by Fukudaand Nakagawa (1988). Compared to conventionalrobots the motivations for such robots was: op-timal shape in a given situation, fault toleranceand the ability to self-repair (Fukuda et al., 1988).Since then more than two dozen systems have beendesigned and implemented, and the field is stillgrowing as indicated in recent surveys (Gilpin andRus, 2010; Murata and Kurokawa, 2012; Stoy et al.,2010; Yim et al., 2007b).

Noteworthy systems developed during the firstdecade of research includes: the mobile CE-BOT (Fukuda et al., 1988), chain-style modularrobots (Castano et al., 2000; Yim, 1994b; Yimet al., 2000), planar self-reconfigurable systems(Chirikjian, 1994; Murata et al., 1994; Rus andVona, 2000) and 3D self-reconfigurable systems(Kotay et al., 1998; Kurokawa et al., 1998; Murataet al., 2000).

Recent systems which are able to self-reconfigurein 3D and form mobile robots include M-TRANIII (Kurokawa et al., 2008), SuperBot (Shen et al.,2006), CKBot (Yim et al., 2007a), Roombots(Sproewitz et al., 2010), ATRON (Østergaard et al.,2006), Replicator-Symbrion (Kernbach et al., 2009),Sambot (Wei et al., 2011), UBot (Zhao et al.,2011), Cross-Ball (Meng et al., 2011) and SMORES(Davey et al., 2012).

2

Page 3: A Distributed and Morphology-Independent Strategy for ...

2.2. Locomotion Strategies for Self-ReconfigurableModular Robots

In the context of modular robots, locomotion wasfirst studied by Yim (1993) using the Polypod robotwhich could be configured as a snake-like chain,hexapod, and a rolling track. Since then numer-ous types of gaits have been implemented on var-ious robot configurations. Especially chain- andhybrid-style systems have proven appropriate forconstructing various different robot morphologiesthat are able to perform locomotion. Moreover, byutilizing self-reconfiguration the robots can trans-form between different morphologies as first demon-strated by Kurokawa et al. (2003). Below we reviewsome of the more generic locomotion strategies pro-posed.

Gait control tables is a generic strategy which wasfirst used to control Polypod robots (Yim, 1994a).A gait control table is a two-dimensional table ofactuator actions (e.g., a goal position or to act asa spring), where the columns are identified with aactuator id and the rows are identified with a con-dition for transition to the following row (e.g., atime-stamp or a sensor condition). To produce pe-riodic locomotion gaits, the table is evaluated in acyclic fashion.

Hormone-based control is a strategy, proposed byShen et al. (2000), which enable information to bedistributed between the modules. Digital hormonesrepresent information that propagates through theconfiguration of modules. Using digital hormonesmodules can detect topological changes, coordi-nate and synchronize actions for locomotion or self-reconfiguration (Shen et al., 2002).

Role-based control was originally developed tocontrol locomotion of CONRO robots (Stoy et al.,2002). Each module, dependent on its position inthe configuration tree, automatically selects a role.This allows a CONRO robot to shift from one loco-motion style to another when it is manually recon-figured (Støy et al., 2003).

Phase automata was a generalization of role-based control developed to control the locomotionof the later Polybot system (Yim et al., 2001a,b).Phase automata patterns are state machines thatdefine the actions of a module in a given state(Zhang et al., 2003). The shift from one state toanother is event (time or sensor) driven.

Central Pattern Generators (CPGs) are a biolog-ically inspired strategy which is a popular approachto locomotion control also for monolithic robots

(Ijspeert, 2008). Genetically optimized CPGs wereused to control M-TRAN walkers and snakes byKamimura et al. (2005, 2003). Similarly, CPGs con-trolling the YaMoR modular robot were geneticallyevolved and online optimized in order to achieve lo-comotion by Marbach and Ijspeert (2005). CPGswere also use to control a simulated quadruped with390 Catom modules (Christensen et al., 2010a).Moreno and Gomez (2011) combined CPGs withhormone-based control to integrate sensor feedbackin the control system.

Cluster-flow locomotion is a radically differentmode of locomotion based on continuous self-reconfiguration of the modules to move the collec-tive ensemble. Such cluster-flow locomotion canbe realized as a highly scalable emergent behaviorarising from simple, distributed, rule-based control,as demonstrated on the sliding cube model (But-ler and Rus, 2003; Fitch and Butler, 2008; Var-shavskaya et al., 2008) and on the ATRON mod-ules (Østergaard and Lund, 2004). For the M-TRAN system, cluster-flow locomotion has been re-alized based on a two-layer centralized planner byYoshida et al. (2001, 2002) and Kurokawa et al.(2008), who demonstrated distributed cluster-flowon physical MTRAN modules. Related work on theSlimebot demonstrates a similar form of locomotionwhere self-reconfiguration is an emergent processthat arises from local module oscillation, resultingin phototaxis (Ishiguro et al., 2004).

2.3. Adaptive Self-Reconfigurable Modular Robots

The polymorphic and metamorphic nature ofself-reconfigurable modular robots gives uniqueconstraints and opportunities for utilizing adaptivestrategies, such as evolutionary and learning, forlocomotion control.

Sims (1994) pioneered the field of evolutionaryrobotics by co-evolving the morphology and con-trol of simulated modular robots. Later works suc-ceeded in transferring co-evolved robots from simu-lation to hardware (Lipson and Pollack, 2000; Mar-bach and Ijspeert, 2004). For M-TRAN robots evo-lutionary algorithms were utilized to produce lo-comotion and self-reconfiguration (Yoshida et al.,2003), and Kamimura et al. (2005) evolved thecoupling parameters of central pattern generatorsfor straight line locomotion. Similarly, Handierand Hornby (2007) evolved a sinusoidal gait fora quadruped modular robot. Pouya et al. (2010)used particle swarm optimization to optimize CPG-based gaits for Roombot robots that both contained

3

Page 4: A Distributed and Morphology-Independent Strategy for ...

rotating and oscillating actuators. Evolutionaryalgorithms have also been used to optimize reac-tive locomotion controllers based on HyperNEAT(Haasdijk et al., 2010), chemical hormones models(Hamann et al., 2010) and fractal genetic regulatorynetwork (Zahadat et al., 2012). For co-evolution ofmorphology and control, Brunete et al. (2013) rep-resented the morphology of a heterogeneous snake-like micro-robot directly in the chromosome, Faınaet al. (2011) represented a legged robot morphologyas a tree, and Bongard and Pfeifer (2003) evolved agenetic regulatory that would direct the growth ofthe robot instead of a direct representation.Although appealing, one challenge with evolu-

tionary approaches is to bridge the reality gap(Mataric and Cliff, 1996) and once transferred therobot is typically no longer able to adapt.To address this limitation, evolutionary optimiza-

tion of gaits can be performed directly on the phys-ical robot (Hornby et al., 2000; Mahdavi and Bent-ley, 2003; Zykov et al., 2004). However, sequentialevaluation of a population of gaits is not fitting forlife-long learning since it requires a steady conver-gence to a single gait.Instead, life-long learning strategies can be uti-

lized. Marbach and Ijspeert (2005) used a life-long strategy, based on Powell’s method, which per-formed a localized search in the space of selected pa-rameters of central pattern generators. Parameterswere manually extracted from the YaMoR modularrobot by exploiting symmetries. Follow-up workby Sproewitz et al. (2008) demonstrated online op-timization of 6 parameters on a physical robot inroughly 25-40 minutes. As is the case in our paper,they try to realize simple, robust, fast, model-less,life-long learning on a modular robot. The main dif-ference is that we seek to automate the controllerdesign completely in the sense that no parametershave to be extracted from the symmetric propertiesof the robot.Further, our approach utilizes a form of dis-

tributed reinforcement learning. A similar ap-proach was taken by Maes and Brooks (1990) whoperformed distributed learning of locomotion on a6-legged robot. The learning was distributed to thelegs themselves. Similarly, in the context of multi-robot systems, distributed reinforcement learninghas been applied for learning various collective be-haviors (Mataric, 1997).The strategy proposed in this paper is indepen-

dent from the robot’s morphology. Similarly, Bon-gard et al. (2006) demonstrated learning of loco-

Algorithm 1 Normal Learning Strategy.

InitializeQ[A] = R, for all A evaluated in randomorderloopSelect Action A with max Q[A] with prob. 1−ϵ, otherwise random actionExecute Action A for T secondsReceive Reward RUpdate Q[A]+ = α · (R−Q[A])

end loop

motion and adaptation to changes in the config-uration of a modular robot. They took a self-modeling approach, where the robot developed amodel of its own configuration by performing ba-sic motor actions. In a physical simulator a modelof the robot configuration was evolved to matchthe sampled sensor data (from accelerometers). Byco-evolving the model with a locomotion gait, therobot could then learn to move with different mor-phologies. Our work presented here is similar inpurpose but different in approach: Our strategy issimple, model-less and computational cheap to al-low implementation on resource constrained modu-lar robots.

In recent work we have studied distributed andmorphology-independent learning based on a CPGarchitecture optimized online by a stochastic ap-proximation method (Christensen et al., 2012,2010c). Our paper utilizes a strategy which hasa simpler implementation and is appropriate foroptimization of discrete control parameters. TheCPG-based strategy is more complex and allows op-timization of continuous control parameters. Bothstrategies illustrate the advantages of utilizing a dis-tributed adaptation and control strategy in modu-lar robots.

3. A Strategy for Learning LocomotionGaits

In this section we describe the basic reinforce-ment learning strategy and propose a heuristics foraccelerated learning. Further, we explain how thestrategy can be applied to optimize gait control ta-bles.

3.1. Normal Learning Strategy

We utilize a simple stateless reinforcement learn-ing strategy that is executed independently and

4

Page 5: A Distributed and Morphology-Independent Strategy for ...

Algorithm 2 Accelerated Learning Strategy.

InitializeQ[A] = R, for all A evaluated in randomorderloop

if max(Q) < R thenRepeat Action A

elseSelect Action A with max Q[A] with prob.1− ϵ, otherwise random action

end ifExecute Action A for T secondsReceive Reward RUpdate Q[A]+ = α · (R−Q[A])

end loop

in parallel by each module of the robot (see Al-gorithm 1). Each module must select which ac-tion, A, to perform amongst a small fixed set,S = {Ax, Ay, Az, ...}, of possible actions. In theinitialization phase each module executes each ofthese actions in random order and initializes theexpected reward, Q[A], with the global reward, R,received after performing action A. Note that thetime it takes to perform this initialization is inde-pendent of the number of modules since the robotwill not try every possible combination of moduleactions. After this phase, in a learning iteration,each module will perform an action, A, and thenreceive a global reward, R, for that learning iter-ation. The discounted expected reward, Q[A], isestimated by an exponential moving average witha smoothing factor, α, which suppresses noise andensures that if the reward of an action changes withtime so will its estimation. Each module indepen-dently selects which action to perform based on anϵ-greedy selection policy, where a module selects theaction with highest estimated Q[A] with a proba-bility of 1− ϵ and a random action otherwise. Thealgorithm can be categorized as temporal differencelearning (TD(0)) with discount factor γ = 0 andwith no representation of the sensor state (Suttonand Barto, 1998).

3.2. Accelerated Learning Strategy

We observe that for a locomotion task, the per-formance of a module is tightly coupled with thebehavior of the other modules in the robot. There-fore, the best action of a module changes over timein response to the other modules changing their ac-tions. The learning speed is limited by the fact

that it must rely on randomness to select a fitterbut underestimated action a sufficient number oftimes before the reward estimation becomes accu-rate. Let us now assume that the robot performs anaction, A, and receives a reward, R, which is fixedand noise free. Then, we can compute the num-ber of iterations, kema, required for the exponentialmoving average of Q[A] to estimate R with a giveprecision e:

kema =log (e)

log (1− α)(1)

Where e is the final estimation error in percent rela-tive to initial error. By Taylor expansion we obtain:

kema ≈ − log (e)

α(2)

Taking into account the ϵ-greedy selection policy,assuming that Q[A] is underestimating R, the num-ber of required learning iterations, klearn becomes:

klearn ≈ − log (e)

ϵ · α(3)

Since ϵ, α < 1 we can expect a slow convergenceto R. Therefore, to accelerate the convergence wetested a simple heuristics (see Algorithm 2): Ifthe received reward, R, after a learning period ishigher than the highest estimation of any action,max(Q), the evaluated action may be underesti-mated and fitter than the current highest estimatedaction. Therefore, we repeat the potentially under-estimated action, A, to accelerate the estimationconvergence. In the accelerated case the number ofrequired learning iterations then becomes:

klearn = kema ≈ − log (e)

α(4)

For example, with α = 0.1 and ϵ = 0.2, the num-ber of learning iterations to reach an estimationwithin 5% of the correct reward is klearn ≈ 148 andklearn ≈ 28 for the normal and accelerated learningstrategy respectively. In general, the accelerationheuristic will speed up the estimation of a fitter butunderestimated action by a factor of 1/ϵ. However,note that the acceleration heuristic is trading offexploration for exploitation, which may cause thestrategy to stay longer in local optima. In Sect. 6we will experimentally compare the two strategies.

3.3. Learning Strategy: Applied to Gait Control Ta-bles

In Algorithms 1 and 2 each module learns to al-ways perform a single action such as rotating clock-wise. To enable a more versatile learning strategy

5

Page 6: A Distributed and Morphology-Independent Strategy for ...

Figure 2: Illustration of gait-table based learning strategy fora single module. Independent and parallel learning processeslearn each entry, A ∈ S, in the gait-table. Therefore, eachmodule learns its own column of the gait-table.

that may work on any module type, we combinethis strategy with gait control tables (Yim, 1994b).As explained in Sect. 2.2, gait control tables area generic strategy that can be used to realize lo-comotion on most modular robots. In our workwe optimize gait control tables that contain actua-tor set-points and which transition to the next rowbased on fixed time intervals.To learn the set-points in a gait-table, we let each

module learn the set-points of its own columns inthe gait-table. Each module runs one parallel learn-ing process per entity in its column. Each learningprocess selects a set-point, A, amongst a set of pre-defined set-points, S. The learning processes learnindependently and in parallel based on a shared re-ward signal. This extended strategy is illustratedin Fig. 2. To utilize this approach for a given sys-tem we must define the set of set-points that canfill the table and the number of rows in the table,the number of columns is indirectly automaticallyadjusted when adding or removing modules.

4. Hardware and Simulation Platforms

4.1. The ATRON Self-Reconfigurable ModularRobot

The ATRON self-reconfigurable robotic systemis a homogeneous modular system, which meansthat all modules are identical in both hardwareand typically also in software (Østergaard et al.,2006). We have manufactured 100 ATRON mod-ules; a single module is shown in Fig. 3(a). Modulescan be assembled into a variety of robots: Robotsfor locomotion (e.g. snake-like chain, wheeledrobot, and legged robot as shown in Fig. 3(b)),

robots for manipulation (e.g. small robot arms) orrobots that achieve some functionality from theirphysical shape, such as structural support (seeFig. 3(c)) (Brandt and Christensen, 2007; Chris-tensen, 2006). By self-reconfiguring, a group ofmodules can change their shape, for example froma wheeled robot to a snake-like robot to a leggedrobot.

An ATRON module has a spherical exterior com-posed of two hemispheres, which can actively rotaterelative to each other (see Fig. 3(a)). On each hemi-sphere, a module has two actuated male connec-tors and two passive female connectors. Rotationaround the center axis is, for self-reconfiguration,always done in 90-degree steps. This moves a mod-ule, connected to the rotating module, from onelattice position to another. A 360 degree rotationtakes approximately 6 seconds. Encoders sense therotation of the center axis. Male connectors are ac-tuated and shaped as three hooks, which grasp onto passive female connector bars. A connection ora disconnection takes about two seconds. Locatednext to each connector are an infrared transmit-ter and receiver, which allow modules to commu-nicate with neighbor modules and sense distanceto nearby objects. Connector positions and orien-tations are such that the ATRON modules sit in aglobal surface-centered cubic lattice structure. Fur-thermore, each module is equipped with tilt sensorsthat allow the module to know its orientation rela-tive to the direction of gravity.

4.2. Unified Simulator for Self-ReconfigurableRobots

Simulation experiments are performed in anopen-source simulator named Unified Simulator forSelf-Reconfigurable Robots (USSR) (Christensenet al., 2008b). We have developed USSR as anextendable physics simulator for modular robots.The simulator is based on Open Dynamics Engine(Smith, 2005) which provides simulation of colli-sions and rigid body dynamics. Through socketconnections USSR is able to run module controllerswhich can also be cross-compiled for some of thephysical platforms.

USSR includes implementations of several exist-ing modular robots, e.g., ATRON and M-TRANas utilized in this paper. The ATRON module ismodeled in the simulator and the model parame-ters, e.g., strength, speed, weight, etc., have beencalibrated to match the existing hardware platform

6

Page 7: A Distributed and Morphology-Independent Strategy for ...

Figure 3: (a) A single ATRON module: on the top hemisphere the two male connectors are extended, on the bottom hemispherethey are contracted. (b) Different configurations of ATRON modules. (c) Meta-modules for self-reconfiguration.

Exp. Type Robot α 1− ϵ TAction-based ATRON 0.1 0.8 7sGait-Table ATRON 0.1 0.96 7sGait-Table M-TRAN 0.0333 0.96 1.5s

Table 1: Learning Parameters

to ease the transfer of controllers developed in simu-lation to the physical modules. The M-TRAN mod-ule fills two cells in a cubic lattice and has six con-nector surfaces. Each module has two actuatorswhich can rotate in the interval ±90o. We haveimplemented a model of the M-TRAN III modulein the USSR simulator based on available specifi-cations (Kurokawa et al., 2008). However, we donot have access to the physical M-TRAN modules.Therefore, although the kinematics is correct, spe-cific characteristics might be somewhat differentfrom the real system. We accept this, since ourpurpose is to evaluate the learning strategy on adifferent system than the ATRON, not to find ef-ficient transferable locomotion gaits for M-TRANrobots.

5. Experimental Setups

5.1. Learning Parameters and Reward Signal

In the physical and the simulated experiments,described in Sect. 6, each module runs identicallearning controllers with parameters set as indi-cated in Table 1. In some experiments we comparewith randomly moving robots, i.e. we set 1− ϵ = 0in Algorithm 1.

Each module, part of a robot, shares and opti-mizes its behavior based on the same reward signal.The reward signal is a measurement of the velocity,which we estimate as the distance moved by the

robot’s center of mass in T seconds:

R = |pt+T − pt|/T (5)

Generally, to reduce noise in the reward signal, Tmust be a multiple of a full gait period, i.e., the timeit takes a module to rotate 360 degrees or oscillateback and forth. Here, T is selected to be a singlefull gait period, which we found gave sufficientlyhigh signal-to-noise ratio to achieve convergence inthe presented experiments. α balances the amountof noise in the reward signal against convergencetime, and ϵ controls the exploration/exploitationtrade-off. The learning parameters in Table 1 weretuned in simulation for reliable and fast conver-gence. Note that the relationship between exper-imental time, iteration time, T , and number of it-erations, k, is simply: time = k ∗ T .

5.2. Action Space

To utilize the learning strategy we must definea set of actions or gait-table set-points. We selectactions/set-points that are module-type specific butsomewhat generic with respect to robot morphol-ogy.

In the action-based experiments we utilizeATRON modules that always perform one of thefollowing three actions:

• HomeStop (rotates to 0 degrees and stop)

• RightRotate (rotate clockwise 360 degrees)

• LeftRotate (rotate counter-clockwise 360 de-grees)

Therefore the basic action set for ATRON is:

S = {HomeStop, RightRotate, LeftRotate}

After a learning iteration, a module should ide-ally be back at its home position to ensure repeata-bility. Therefore, the modules will synchronize their

7

Page 8: A Distributed and Morphology-Independent Strategy for ...

progress to follow the rhythm of the learning itera-tion based on the shared reward signal.In the gait-table experiments, to study how the

strategy works on a system with a different kine-matics, we will use joint-limited ATRON modulesthat are limited to rotate within a ±90 degree in-terval. In effect, the gait-table-based learning strat-egy must find gaits based on oscillations to movethe robots, instead of gaits based on continuous ro-tations. For each robot morphology we define aninitial pose that the actuation is performed relativeto. The gait-table has five rows, so each modulemust learn five angle values from the set-point set:

S = {−60,−30, 0, 30, 60} degrees

Fig. 18 shows examples of ATRON gait-tables.The M-TRAN module has two actuators. There-

fore we let each actuator be controlled by inde-pendent gait-tables. Each gait-table has five rows,where an entry can contain one of three set-points:

S = {−30, 0, 30} degrees

In general, selecting an initial pose and the set-point sets is a tradeoff between high potential tomove and being stable so that the robot does notfall over while learning. Sect. 7 discuss such as-sumptions and limitations in more detail.

5.3. Physical Setup

The ATRON modules are not equipped with asensor that allows them to measure their own ve-locity or distance travelled. Instead, we have con-structed a setup, which consists of an arena with anoverhead camera connected to a server (see Fig. 4).The server tracks the robot, computes the robot’svelocity, and sends the reward signal wirelessly tothe robot.The accelerated learning algorithm, as specified

in Algorithm 2, is running on the physical modules.At a frequency of 10 Hz each module sends a mes-sage containing its current state (paused or learn-ing), time-step and reward to all of its neighborsthrough its infra-red communication channels. Thetime-step is incremented and the reward is updatedfrom the server side every T = 7 seconds. When anew update is received, a module performs a learn-ing update and start a new learning iteration. Thestate can from the server side be set to paused orlearning. The robot is paused by the server whenit moves beyond the borders of the arena and is

then manually moved back onto the arena beforethe learning is continued. In the presented results,the paused time intervals have been removed.

6. Experiments

6.1. A Typical ATRON Robot

In this experiment we study how the normal andaccelerated action-based learning strategy behaveson a typical modular robot. We consider a sim-ulated quadrupedal consisting of 8 ATRON mod-ules. To facilitate the analysis we simplify and con-trol the experimental variables and the initial con-ditions: i) we disable four of the modules (i.e., theyare stopped in the home position) and only allowthe four modules acting as legs to be active, as in-dicated in Fig. 5(a). ii) Also, we force the robotto start learning from a completely stopped stateby initializing Q[A] to 0.1 for the HomeStop ac-tion and to 0.0 for the other actions. Note, thatthese two simplifications will prolong the conver-gence time (robot will seldom move initially) andmake the robot less able to locomote (only four ofeight modules are active). The motivation is thatthe result of each trial will be less dependent onstarting conditions and the gait learning analysiswill be simpler. Therefore, we can compare the twostrategies more directly and analyze their behavior.The other experiments described in this paper donot make these simplifications.

First, consider the two representative learning ex-amples given in Fig. 5. The contour plot in Fig. 5(b)illustrates the fitness landscape (found using sys-tematic search) and how an example robot con-troller transitions to gradually improve itself. Thecontroller eventually converges to one of the fouroptimums, which corresponds to the symmetry axesof the robot (although in one case the robot hasa single step fallback to another controller). Thegraphs in Fig. 5(c) and 5(d) show how the velocityof the robots jumps in discrete steps, that corre-spond to changes in the preferred actions of mod-ules.

Fig. 6 compares the convergence speed and per-formance of the learning with and without the ac-celeration heuristic. The time to converge is mea-sured from the start of a trial until the controllertransitioned to one of the four optimal solutions(found with systematic search). In all 20 trials therobot converged, in 4 trials the robot had short fall-backs to non-optimal controllers (as in Fig. 5(c)).

8

Page 9: A Distributed and Morphology-Independent Strategy for ...

Figure 4: Experimental setup of online learning.

(a) Quadrupedal

RR SR LR LS SS RS RL SL LL

RR

SR

LR

LS

SS

RS

RL

SL

LL

Action of Leg 2 and Leg 4

ActionofLeg

1andLeg

3

(b) Path through Control Space

0 500 1000 1500 2000 2500 3000 35000.000

0.005

0.010

0.015

0.020

0.025

Time HsecondsL

VelocityHm

eters�secL

(c) Normal Learning

0 500 1000 1500 2000 2500 3000 35000.000

0.005

0.010

0.015

0.020

0.025

Time HsecondsL

VelocityHm

eters�secL

(d) Accelerated Learning

Figure 5: Typical simulated learning examples with and without the acceleration heuristic. (a) Eight module quadrupedalrobot (four active modules). (b) Contour plot with each point indicating the velocity of a robot performing the correspondingcontroller (average of 10 trials per point). The arrows show the transitions of the preferred controller of the robot. (c) And (d)shows the corresponding rewards received by the robots during one hour. The horizontal lines indicate the expected velocitybased on the same data as the contour plot.

9

Page 10: A Distributed and Morphology-Independent Strategy for ...

Figure 6: The velocity of a quadrupedal robot with four ac-tive modules as a function of time. Each point is the averageof 10 trials. The horizontal bars indicate average convergencetime and standard deviation. Note that accelerated learningconverges significantly faster (P=0.0023) for this robot.

On average accelerated learning converged faster(19 minutes or 1146 iterations) than normal learn-ing (32 minutes or 1898 iterations). The differenceis statistically significant (P=0.0023 1). Note thataccelerated learning on average reaches a highervelocity, but not due to the type of gaits found.Rather, the faster velocity is due to the accelerationheuristics, which tends to repeat good-performingactions at the cost of random exploration. This canalso be seen by comparing the amount of noise inFig. 5(c) with 5(d).As summarized in Table 2, the learning strategy

behaves in roughly the same way independently ofthe acceleration heuristic. A typical learning trialconsists of 4-5 controller transitions, where a mod-ule changes its preferred action before the controllerconverges. In about 90% of these transitions it willonly change the action of one module. This indi-cates that at a global level the robot is performing alocalized random search in the controller space. Al-though the individual modules are not collectivelysearching in any explicit manner, this global strat-egy emerges from the local strategy of the individ-ual modules.

6.2. Morphology Independence

In this simulated experiment, we perform on-line learning with seven different simulated ATRONrobots to study the degree to which the learningstrategy is morphology independent (see Fig. 7). In

1Using Student’s t-test

Normal AcceleratedTransitions per Trial 4.4 (1.17) 4.0 (0.94)1-Step Transitions 87% 90%2-Step Transitions 13% 6%3-Step Transitions 0% 4%4-Step Transitions 0% 0%

Table 2: Average number of controller transitions to reachoptimal solution, with standard deviations in parenthe-ses. To measure the number of controller transitions, verybrief transitions of one or two learning steps (7-14 seconds)are censored away. The results are based on 10 trials ofquadrupedal with 4 active modules learning to move. Note,that there is no significant difference in the type of controllertransitions. Also, 1-step transitions are by far the most com-mon, which indicates that the search is localized.

Figure 8: Velocity at the end of learning in simulation. Eachbar is the average velocity (reward) from the 50th to the60th minute of 10 independent trials. Error bars indicateone standard deviation of average robot velocity. Note thatboth normal and accelerated learning has an average highervelocity than random movement.

each trial, the robot had 60 minutes to optimize itsvelocity. For each robot type 10 independent trialswere performed with the normal, accelerated andrandom strategy. Results are shown in Fig. 8.

Compared to randomly behaving robots, bothnormal and accelerated learning improves the av-erage velocity significantly. We observe that eachrobot always tends to learn the same, i.e., sym-metrically equivalent gaits. There is no differencein which types of gaits the normal and acceleratedlearning strategy finds. Overall, the learning of lo-comotion is effective and the controllers are in mostcases identical to those we would design by handusing the same action primitives. A notable excep-tion is the snake robot which has no known goodcontroller given the current set of action primitives.The other robots converged within 60 minutes to

10

Page 11: A Distributed and Morphology-Independent Strategy for ...

(a) Two-wheeler (b) Snake-4 (c) Bipedal (d) Tripedal

(e) Quadrupdal (f) Crawler (g) Walker

Figure 7: Seven learning ATRON robots consisting of 3 to 12 modules.

best-known gaits in 96% of the trials (115 of 120trials). Convergence time was on average less than15 minutes for those robots, although single trialswould be caught in suboptimal solutions for ex-tended time periods. We found no general trend inthe how the morphology affects the learned gaits.For example, there is no trend that smaller robotsor larger robots are faster, except that wheeled lo-comotion is faster than legged locomotion.

6.3. Physical ATRON Robots

In this section we present experiments with phys-ical ATRON robots to investigate if the acceleratedlearning strategy can successfully be transferredfrom simulation to physical robots. The learningstrategy is executed by the modules and only the re-ward signal is computed externally. We utilize twodifferent robots, a three-module two-wheeler andan eight-module quadrupedal, which has a passiveninth module used only for wireless communica-tion. For each robot, we report on five experimentaltrials, two extra experiments (one for each robot)were excluded due to mechanical failures during theexperiments. An experimental trial ran until therobot had converged to a near optimal gait (whichwere known from the equivalent simulation exper-iments) and stayed unchanged for several minutes.Since not all physical experiments are of equal du-ration, we extrapolate some experiments with theaverage velocity of its last 10 learning iterations togenerate the graphs of Fig. 9(a) and 9(c). In total,we report on more than 4 hours of physical experi-mental time.

6.3.1. Two-wheeler

Table 3 shows some details for five experimen-tal trials with a two-wheeler robot. The time toconverge to driving either forward or backward isgiven (i.e., the fastest known gaits for the two-wheeler). For comparison the equivalent conver-gence time measured in simulation experiments isalso given. In three of the five experiments, therobot converges to the best-known solution withinthe first minute. As was also observed in simu-lation trials, in the other two trials the robot wasstuck for an extended period in a suboptimal behav-ior before it finally converged. We observe that thephysical robot on average converges a minute slowerthan the simulated robot, but there is no significantdifference (P=0.36) between simulation and physi-cal experiments in terms of mean convergence time.Fig. 9 shows the average velocity (reward given tothe robot) as a function of time for the two-wheelerin both simulation and on the physical robot. Thefound gaits are symmetrically identical, however thephysical robot moves faster than in simulation dueto simulator inaccuracies.

6.3.2. Quadrupedal

Pictures from an experimental trial are shown inFig. 10, where a 9-module quadrupedal (8 activemodules and 1 for wireless communication) learnsto move. Table 3 summarized the result of five ex-perimental trials. In all five trials, the robot con-verges to a best-known gait (symmetrically identi-cal gaits to those found in simulation). The averageconvergence time is less than 15 minutes, which is

11

Page 12: A Distributed and Morphology-Independent Strategy for ...

Twowheeler QuadrupedalExp. Conv. Time Exp. Time Conv. Time Exp. Time

(seconds) (seconds) (seconds) (seconds)1 28 629 1680 24012 35 707 1232 21573 567 967 448 18914 28 695 700 22365 798 1020 336 2468

Total 1456 4018 4396 11153Phy. mean 291 804 879 2231Sim. mean 225 — 701 —

Table 3: Results of online learning on two-wheeler and quadrupedal robots.

0 200 400 600 8000.00

0.01

0.02

0.03

0.04

0.05

Time HsecondsL

VelocityHm

eters�secL

(a) Physical Two-Wheeler

0 200 400 600 8000.00

0.01

0.02

0.03

0.04

Time HsecondsL

VelocityHm

eters�secL

(b) Simulated Two-Wheeler

0 500 1000 1500 20000.01

0.02

0.03

0.04

0.05

Time HsecondsL

VelocityHm

eters�secL

(c) Physical Quadrupedal

0 500 1000 1500 20000.010

0.012

0.014

0.016

0.018

0.020

0.022

Time HsecondsL

VelocityHm

eters�secL

(d) Simulated Quadrupedal

Figure 9: Average velocity of five trials as a function of time for both physical and simulated experiments for a two-wheelerand a quadrupedal. Points are the average reward in a given timestep and the lines indicate the trend.

Figure 10: Pictures from learning experiment with quadrupedal walker. A 7-seconds period is shown. The robot starts inits home position, performs a locomotion period, and then returns to its home position. In each of the five experiments, thequadrupedal converged to symmetrically equivalent gaits. All five gaits were equivalent to the gaits found in simulation.

12

Page 13: A Distributed and Morphology-Independent Strategy for ...

slower than the average of 12 minutes it takes toconverge in simulation. The difference is, however,not statistical significant (P=0.29). Fig. 9 showsthe average velocity versus time for both simulatedand physical experiments with the quadrupedal.We observe that the measured velocity in the physi-cal trials contains more noise than the simulated tri-als. Furthermore, the physical robot also achievesa higher velocity than in simulation (due to simu-lator inaccuracies). Another observation we madewas that the velocity difference between the fastestand the second-fastest gait is smaller in the physi-cal experiments than in simulation, which togetherwith the extra noise may explain why the physicaltrial on average converges almost 3 minutes slowerthan in simulation.

6.4. Gait-Table Learning with ATRON Modules

In this experiment we apply the acceler-ated learning strategy on gait-control tables foran ATRON snake (chain with seven modules),millipede-2 (two leg pairs, 10 modules), millipede-3(three leg pairs, 15 modules) and a quadrupedal (8modules). We do this to study if the learning strat-egy can be utilized to optimize gait-tables while stillbeing morphology independent.

Fig. 11 shows the convergence as the average ve-locity achieved over time compared with randomlymoving robots. Note that a robot tends to quicklylearn a better than random gait, and that this gaitgradually improves over time. Compared to blindrandom search the convergence speed is similar butthe learning strategy finds significantly better gaits,e.g., on average 9.9 cm/sec and 13.3 cm/sec re-spectively for the millipede-3 robot. Although therobots moves slower than if they could performunlimited rotation, the gaits found are quite ef-ficient. Also note that in the case of the snakerobot, the action-based learning strategy fails toconverge since the robot entangles itself, while thisgait-table based strategy converges to undulation-style gaits. All the 40 experimental trials convergedto good performing gaits. Divergence happens in afew cases, e.g., when a snake robot rolls upside downduring learning and then had to learn to move fromthis state.

Some typical gaits found: The snake is movingwith a side-winding gait, with a traveling wavedown the chain. The snake lifts parts of its body offthe ground as it moves. A typical quadrupedal gaitcould use a back foot partly as a wheel, partly as a

foot. Its side legs moves back and forward for move-ment, while the front leg is used just for support.The millipede-2 has a trot style gait, where the di-agonal opposite legs move together. The millipede-3 uses a similar gait with each leg oscillating backand forward with some unrecognizable scheme ofsynchronization between the legs.

6.5. Gait-Table Learning with M-TRAN

In this experiment we apply the gait-table-basedlearning strategy on three simulated M-TRANrobots (see Fig. 12): A 6-module caterpillar (12DOF), a 4-module mini walker (8 DOF) and an8-module walker (16 DOF). We perform the studyto see if the learning strategy can be utilized on adifferent modular robotic platform than ATRON.

Fig. 13 shows convergence graphs for the threerobots. Notice that the performance of the gaitsquickly becomes better than random and that thegaits gradually improve over time. The learningsucceeds in finding efficient gaits for all three robots.Because of the short learning iteration (T=1.5 sec-onds), even a pose shift can be measured as quitehigh velocity. Therefore, randomly moving robotsincorrectly seems to move quite fast, but this doesnot affect the learning strategy since it quickly con-verges to gaits with movement in a specific direc-tion. We observe that the large learning spaceleaves room for incremental learning.

A major challenge with learning M-TRAN gaitsis that the robot often falls over while learning. Thishappened in 23 percent, 8 percent and 47 percent ofthe two hour trials with the mini walker, caterpillarand walker respectively. These trials were censoredaway in the presented results, which is based on 10completed trials per robot.

Some typical learned gaits: Typical gaits for themini walker consist of hopping movement, with twomodules producing movement and two modules cre-ating stability. For the caterpillar, the learning typ-ically finds gaits either with a horizontal travellingwave down the chain of modules or gaits that usethe head and tail modules to push on the ground.Successful gaits for the walker take relatively shortsteps, since the robot would otherwise fall over. Inone example trial the walker used three legs to pro-duce movement, while the fourth leg is kept liftedoff the ground in front of the robot.

6.6. Self-Reconfiguration and Module Fault

In this experiment, we study if the acceleratedaction-based learning strategy is able to adapt

13

Page 14: A Distributed and Morphology-Independent Strategy for ...

0 1000 2000 3000 4000 5000 6000 7000

0.005

0.010

0.015

0.020

Time HsecondsL

VelocityHm

eters�secondL

(a) Snake

0 1000 2000 3000 4000 5000 6000 7000

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

Time HsecondsL

VelocityHm

eters�secondL

(b) Quadrupedal

0 1000 2000 3000 4000 5000 6000 7000

0.005

0.010

0.015

Time HsecondsL

VelocityHm

eters�secondL

(c) Millipede-2

0 1000 2000 3000 4000 5000 6000 7000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

Time HsecondsL

VelocityHm

eters�secondL

(d) Millipede-3

Figure 11: Convergence graphs for the four different robots assembled from joint-limited ATRON modules. As a baseline theaverage velocity of random moving robots is also shown. Each graph is the average of 10 trials. Errorbars indicate standarddeviation.

(a) Mini (b) Caterpillar (c) Walker

Figure 12: Three M-TRAN robots consisting of 4, 6, and 8 modules.

14

Page 15: A Distributed and Morphology-Independent Strategy for ...

0 1000 2000 3000 4000 5000 6000

0.010

0.015

0.020

0.025

0.030

0.035

0.040

Time HsecondsL

VelocityHm

eters�secondL

(a) Mini

0 1000 2000 3000 4000 5000 6000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

Time HsecondsL

VelocityHm

eters�secondL

(b) Caterpillar

0 1000 2000 3000 4000 5000 6000

0.015

0.020

0.025

0.030

0.035

Time HsecondsL

VelocityHm

eters�secondL

(c) Walker

Figure 13: Average velocity as a function of time for three M-TRAN robots. Each graph is the average velocity of 10 independenttrials. Average velocity of randomly moving robots is shown for comparison. Errorbars indicate one standard deviation.

Figure 14: Procedure for testing adaptation to self-reconfiguration and module fault. Initially the robot is ofa crawler type, it then self-reconfigures to a quadrupedal,then a module fails and finally the module again becomesfunctional.

to changes in the robot morphology due to self-reconfiguration and module faults. Initially we leta crawler type robot (8 modules) learn to move (seeFig. 14). At learning iteration 250 (after 29 min-utes), the robot is programmed to self-reconfigureinto a quadrupedal type robot. Afterwards, thelearning is continued without resetting the learn-ing system. After an additional 250 iterations, wesimulate a module failure by stopping a leg modulein a non-home position. 250 iterations later, we re-activate the module and let the learning continuefor another 250 iterations.

Figure 15: Learning during robot self-reconfiguration andmodule fault. In each case, the learning enables the robotto adapt to the new situation by changing the locomotiongait. The graph is the average of 10 trials, with standarddeviation as error bars. The bottom line is 10 trials of theequivalent robot moving randomly.

15

Page 16: A Distributed and Morphology-Independent Strategy for ...

Figure 16: Procedure for testing adaptation of gait-table tomodule loss. Initially the robot is of a quadrupedal type, itthen disconnects a module as shown.

Fig. 15 shows the average results of 10 trials. Af-ter both the self-reconfiguration and the modulefault, we observe a drop in fitness as expected. Inboth cases, the learning system is able to adapt toits changed morphology and regain a higher veloc-ity. In the case where a leg module is reactivatedthere is no initial drop in fitness, but afterwards therobot learns again to use its leg and the average ve-locity increases.

6.7. Gait-Table Learning with Module Loss

In this experiment we study if a gait-table con-trolled ATRON quadrupedal robot is able to adaptto the loss of a module used as a lower limb. Thegait-table is optimized using the accelerated learn-ing strategy. First the quadrupedal learns to movefor 1000 iterations. Then the robot disconnects aspecific module thereby disabling its movement, asillustrated in Fig. 16. The robot is then given ad-ditional 1000 iterations to learn to move with thisnew morphology.Fig. 17 shows the average velocity of 10 indepen-

dent trials. For the first 1000 iterations the robotgradually increases its velocity. Then the moduleis disconnected and we observe a drastic drop invelocity to the level of a randomly moving robot.The robot slowly adapts to its new morphology andregains the majority of its lost velocity. Fig. 18shows two gait-tables from a typical trial. The firstgait-table shown is just before the module is discon-nected and the other gait-table is at the end of thetrial. We observe that the two gait-tables containsmainly different joint values (77 %) which may in-dicate that the two robots required quite differentgait patterns for effective locomotion.

6.8. Scalability

We performed simulated experiments with a scal-able millipede ATRON robot to study the scalabil-

Module Disconnected

Quadrupedal Tripedal

0 2000 4000 6000 8000 10 000 12 000 14 0000.000

0.005

0.010

0.015

0.020

0.025

0.030

Time HsecondsL

Vel

oci

tyHm

eter

s�se

cL

Figure 17: Learning after loss of module using an ATRONquadruped controlled with a gait-table. The graph is theaverage of 10 trials, with standard deviation as error bars.The bottom line is 10 trials of the equivalent robot movingrandomly.

Figure 19: Example of millipede robot with 8 leg pairs and40 modules used to study scalability. Given the basic ac-tions of ATRON, the best know controller rotates the legsas indicated.

16

Page 17: A Distributed and Morphology-Independent Strategy for ...

m0 m1 m2 m3 m4 m5 m6 m7

t0 30 -30 60 60 -60 30 30 30t1 0 -60 0 60 -30 0 -60 30t2 -60 -30 -60 60 -60 -60 30 0t3 -30 30 30 -60 -30 -30 60 -60t4 0 -60 60 -60 0 30 0 -30

(a) Before Module Loss

m0 m1 m2 m3 m4 m5 m6 m7

t0 30 -60 -60 30 * -60 -30 30t1 30 -30 0 -30 * 30 -30 0t2 -60 -60 -60 30 * -30 -30 -30t3 -60 -30 30 -60 * 60 -60 30t4 0 30 -30 0 * 60 -60 -60

(b) After Module Loss

Figure 18: Typical trial of gait-table learning with module loss experiment. (a) Gait-table for the eight modules (m0...m7) justbefore m4 is disconnected (iteration 1000). (b) Gait-table at end of trial (iteration 2000), 27 of 35 joint values in the table havechanged as a result of the optimization (shaded cells).

2 4 6 8 10 120

1000

2000

3000

4000

5000

10 20 30 40 50 60

Number of Leg Pairs

Convergence

Tim

eHsec.L

Number of Modules

(a) Convergence

æ

æ

æ

æ

æ

æ

æ

æ

æ

Converge

Diverge

5 10 150.0

0.2

0.4

0.6

0.8

1.0

10 20 30 40 50 60 70 80 90

Number of Leg Pairs

Divergence

FrequencyHper

two-hourtrialL

Number of Modules

(b) Divergence

Figure 20: (a) Convergence time versus number of modules. (b) Divergence versus number of modules. The robots aremillipedes with 4 to 24 legs. Error bars indicate one standard deviation.

17

Page 18: A Distributed and Morphology-Independent Strategy for ...

ity of the learning strategy (see Fig. 19). In thefollowing experiments, we vary the number of legsfrom 4 to 36 in steps of 4 with 10 learning trials perrobot.For this specific experiment we define the time

of convergence as the time at which 85% of the legmodules have learned to contribute to the robotmovement. That is, the leg module rotates eitherleft or right dependent on its position in the robotand the direction of locomotion. The time to con-verge is shown in Fig. 20(a). As expected, an in-crease in the number of modules also increases theconvergence time, the relation is approximately lin-ear for this robot in the interval shown. The in-crease in convergence time is rather slow, for eachmodule added the convergence time is prolongedwith 52 seconds (based on a least square fit):

convergenceT ime = 52 ·#modules+182 sec. (6)

Beyond this interval of up to 60 modules, divergencebecomes the dominating factor, i.e., the robot for-gets the already learned behavior.We measure learning divergence as a major drop

in number of leg modules contributing to movingthe millipede. The frequency of diverges of eachrobot is shown in Fig. 20(b). We observe that thedivergence frequency increases with the number ofmodules. The reason behind this is that as the num-ber of modules increases, the effect that any indi-vidual module has on the robot decreases. There-fore, for a given module the estimates for each of itsactions will almost be identical and a small distur-bance can cause the divergence effect. This decreaseof the signal-to-noise ratio is illustrated in Fig. 21for a 6-legged and 20-legged millipede. The graphsshow the result of using a simple, regressing con-troller on the robots: Initially the robot is movingwith the best know controller. Then, every 200 sec-onds, one of the leg modules reverses its action fromrotating left to right and vice versa. This causes therobot to slow down its movement until it is almostfully stopped. For the small robot, the differencein velocity from one module change to the other islarge and a module can easily detect the difference.For the large robot, this difference is small com-pared to noise, which explains why the divergenceeffect increases with the number of modules.

7. Discussion

Adaptive robots often utilize a centralized learn-ing strategy. In this paper the learning is dis-

tributed to the individual modules which optimizethe behavior independently and in parallel basedonly on a simple centralized reward signal. In Sect.6.1 we observed that a centralized strategy similarto localized random search emerged from these dis-tributed adaptive modules. The extent to whichthis emergent phenomenon can be utilized in otherlearning scenarios than locomotion is an open re-search question.

A robot controller is typically designed specifi-cally for a robot’s morphology. The proposed strat-egy does not make any assumptions about howthe modules are connected to form a specific robotconfiguration. Combined with the distributed im-plementation, this makes the strategy morphologyindependent and it will automatically adapt to alarge class of different morphologies, as was exper-imentally demonstrated in Sect. 6.2. For the samereasons the strategy is also able to adapt to dy-namic changes in its morphology, e.g., due to self-reconfiguration, module failure or loss, as the ex-periment in Sect. 6.6 and 6.7 demonstrated.

The strategy will, however, generally not workon morphologies that fall over or self-collide whilelearning. Currently, the strategy makes no attemptto deal with these issues and it is up to the robot de-signer to select a morphology and initial pose thatminimize the risk of such issues. Higher level strate-gies could be developed to increase the system’s ro-bustness to such events. For example, Morimotoand Doya (2001) demonstrated a robot able to learnto stand up after falling over. Similarly, for fault-tolerant self-reconfiguration, distributed strategieshave been developed to deal with internal colli-sions, unreliable communication, and hardware fail-ures (Christensen, 2007; Schultz et al., 2011).

No single optimization strategy will be univer-sally good for all problems (Wolpert and Macready,1997), therefore the strategy must be selected tomatch the problem of interest. The strategy pro-posed in this paper is minimalistic to facilitatea simple distributed implementation with low re-source requirements. Clearly, less minimal strate-gies that for example utilize gradient information inthe optimization process might in some cases con-verge faster and find gaits with a higher velocity.We have demonstrated such a distributed strategyin recent work which is based on a stochastic op-timization of central pattern generator parameters(Christensen et al., 2012, 2010c).

Ideally, a control strategy should generalize andbe applicable to more than a single robotic plat-

18

Page 19: A Distributed and Morphology-Independent Strategy for ...

0 200 400 600 8000.000

0.005

0.010

0.015

0.020

0.025

0.030

Time HSecondsL

VelocityHm

eters�secL

(a) 6-Leg Millipede

0 500 1000 1500 20000.000

0.005

0.010

0.015

0.020

0.025

0.030

Time HSecondsL

VelocityHm

eters�secL

(b) 20-Leg Millipede

Figure 21: Divergence in large-scale robots. If the number of modules are low a single module may have a large effect on theoverall velocity compared to noise, the opposite is the case for many module robots.

form. We demonstrated in a number of experimentsthat the proposed strategy could not only be ap-plied to freely rotating ATRON robots, but also tojoint-limited ATRON and M-TRAN robots by opti-mizing gait control tables (Sect. 6.4 and 6.5). Gaitcontrol tables are highly portable to both modularand monolithic robots. However, the strategy canonly fill the table with values from a small fixed setof possible joint positions. This can in practice limitthe usability of the strategy, and methods that willoptimize floating point values may be more suitablefor gait-control-table optimization.

The problem of transferring a robot controllerfrom simulation to the physical robot needs specialattention due to the difficulty of accurately simulat-ing the complex interactions between the robot andits environment (Brooks, 1992; Mataric and Cliff,1996). The proposed strategy is model-less, onlineadaptive, and does not currently utilize direct sen-sor feedback, which makes the transference problemless problematic as we demonstrated with ATRONrobots in Sect. 6.3. However, if the strategy shouldbe applied to robots learning in natural terrain withonboard sensors for measuring velocity, special at-tention must be given to the signal-to-noise ratio ofthe reward signal.

In this paper the reward signal is measured ex-ternally (in the simulator or using overhead cam-era) and globally distributed to the modules. Toincrease autonomy onboard sensors could be usedto estimate the robot’s velocity, e.g., an inertialmeasurement unit (IMU) or an optical flow sensor.However, whether such sensors are sufficiently reli-able for the presented learning strategy is an openquestion. To realize a fully distributed learning

system, each module must optimize their behav-ior based on a local reward signal based on localsensors, such a system was demonstrated by Var-shavskaya et al. (2004) for learning cluster walkingbehavior for self-reconfigurable robots.

In practical robotic applications, a learning strat-egy must converge fast and reliably. For theATRON robots we studied in Sect. 6.2, we foundthat the non-snake robots would converge to best-known gaits in 96% of the cases within on average15 minutes. For the gait-table experiments the con-vergence time were more often around 60 minutes.We anticipate that this might be sufficiently reliableand fast enough for some practical applications butprobably not all. Research still remains in exploringhow to best improve these characteristics.

Scalability in terms of the number of modules is ofspecial concern in distributed robotics. We found,for an example ATRON robot, that learning conver-gence time would only grow slowly, approximatelylinearly, with the number of modules. However, di-vergence would become increasingly frequent withthe number of modules, and beyond 50 modules itwould be more likely to happen than not during atwo-hour trial (Sect. 6.8). The scale effects were dueto a decrease in the signal-to-noise ratio which willhave a negative effect on all learning algorithms.We anticipate that additional inspiration from theneuroscience of biological organisms might enableus to form hypothesis about how to cope with thisscalability issue.

The module loss experiment, presented inSect. 6.7, is comparable in form and complexity tothe experiments by Bongard et al. (2006) on self-repair based on self-modeling. Both experiments

19

Page 20: A Distributed and Morphology-Independent Strategy for ...

utilize an 8 DOF walking quadruped robot con-trolled by a gait-table. Further, in both experi-ments, damage is induced by removing a lower limbto study the process of self-repair. However, theunderlying mechanisms are very different resultingin different trade-off between the two methods. InBongard et al. (2006), to recover from damage, aself-model (with 16 open parameters) and a gait-table is optimized using only 16 physical actionsand ≈ 300k simulated actions. In our experimentthe strategy requires ≈ 1k physical actions andno simulated actions. Therefore, compared to theself-modeling strategy, the advantages of our dis-tributed strategy is that it is simple, computation-ally minimal, does not require a simulation systemand makes fewer assumptions about the robot mor-phology. The advantages of the self-modeling strat-egy is that it requires much less physical actions andthat the self-model can also be used for predictivepurposes (e.g. to predict if a given action will makethe robot fall over). None of the strategies can beconsidered a stand-alone control system, and it ishighly dependent on the application which wouldbe more appropriate. How to best combine the twostrategies is an open question.

8. Conclusion

In this paper, we studied a distributed andmorphology-independent strategy for adaptive lo-comotion in modular self-reconfigurable robots. Wedescribed a simple reinforcement learning strategyand proposed a heuristic for accelerating the learn-ing.We applied the strategy to optimize both sim-

ple actions and gait-control tables. We presentedsimulated and physical experiments for a range ofdifferent robots constructed from ATRON and M-TRAN robots. In simulation we studied a learningquadrupedal robot and found that from its inde-pendently learning modules, a higher-level learningstrategy emerged, which was similar to localizedrandom search. We performed experiments in simu-lation of ATRON modules, which indicate that thestrategy is sufficient to learn quite efficient locomo-tion gaits for a large range of different morphologiesup to 12-module robots. A typical learning trialconverged in less than 15 minutes depending on thesize and type of the robot. Further, we performedexperiments with physical ATRON robots onlinelearning to move. These experiments validated oursimulation results. Moreover, we experimentally

found that the strategy was able to optimize gaitcontrol tables for both joint-limited ATRON andM-TRAN modules. However, as anticipated, theincreased size of the learning space came at thecost of prolonged time to learn a gait. Yet, even themost complex gaits are typically learned within onehour. In addition, we presented simulated experi-ments with faults, self-reconfiguration and moduleloss that illustrated the advantages of utilizing adistributed and configuration-independent learningstrategy. We observed that the modules after self-reconfiguration were able to learn to move with anew morphology and adapt to module faults andloss. In simulation, we studied the scalability char-acteristics of the learning strategy and found that itcould learn to move a robot with up to 60 modules.However, the effects of divergence in the learningwould eventually become dominant and prevent therobot from being scaled further up. We also foundthat the convergence time increased slowly, approx-imately linearly, with the number of modules withinthe functional range.

In conclusion, we found that learning can ef-fectively be distributed by introducing indepen-dent processes that learn in parallel for online,locomotion-gait learning in modular robots. Fur-thermore, this approach has several advantagescompared to centralized methods, such as sim-plicity in implementation, low resource require-ments, morphology independence, adaptability toself-reconfiguration, and inherent fault tolerance.

Acknowledgements

This work was performed as part of the “Loco-morph” project funded by the EUs Seventh Frame-work Programme (Future Emerging Technologies,Embodied Intelligence) and as part of the “Assem-ble and Animate” project funded by the DanishCouncil for Independent Research (Technology andProduction Sciences).

References

Bongard, J., Zykov, V., Lipson, H., 2006. Resilient machinesthrough continuous self-modeling. Science 314 (5802),1118–1121.

Bongard, J. C., Pfeifer, R., 2003. Evolving complete agentsusing artificial ontogeny. In: Morpho-functional Machines:The New Species. Springer, pp. 237–258.

Brandt, D., Christensen, D. J., Nov. 2007. A new meta-module for controlling large sheets of atron modules. In:

20

Page 21: A Distributed and Morphology-Independent Strategy for ...

Proc. of IEEE/RSJ International Conference on Intel-ligent Robots and Systems. San Diego, California, pp.2375–2380.

Brooks, R. A., 1992. Artificial life and real robots. In: To-ward a Practice of Autonomous Systems: Proceedings ofthe First European Conference on Artificial Life. Cam-bridge, MA, USA, pp. 3–10.

Brunete, A., Hernando, M., Gambao, E., april 2013.Offline ga-based optimization for heterogeneous modu-lar multiconfigurable chained microrobots. Mechatronics,IEEE/ASME Transactions on 18 (2), 578 –585.

Butler, Z., Rus, D., 2003. Distributed locomotion algorithmsfor self-reconfigurable robots operating on rough terrain.In: Proceedings of IEEE International Symposium onComputational Intelligence in Robotics and Automation(CIRA’03). pp. 880–885.

Castano, A., Shen, W.-M., Will, P., June 2000. Conro: To-wards deployable robots with inter-robot metamorphic ca-pabilities. Autonomous Robots 8 (3), 309–324.

Chirikjian, G., 1994. Kinematics of a metamorphic roboticsystem. In: Proc. of IEEE Intl. Conf. on Robotics andAutomation. pp. 449–455.

Christensen, D. J., May 2006. Evolution of shape-changing and self-repairing control for the ATRON self-reconfigurable robot. In: Proceedings of the IEEE In-ternational Conference on Robotics and Automation(ICRA’06). Orlando, Florida, pp. 2539–2545.

Christensen, D. J., April 2007. Experiments on fault-tolerantself-reconfiguration and emergent self-repair. In: Proceed-ings of Symposium on Artificial Life part of the IEEESymposium Series on Computational Intelligence. Hon-olulu, Hawaii, pp. 355–361.

Christensen, D. J., Bordignon, M., Schultz, U. P., Shaikh,D., Stoy, K., 2008a. Morphology independent learning inmodular robots. In: Proceedings of International Sym-posium on Distributed Autonomous Robotic Systems 8(DARS 2008). pp. 379–391.

Christensen, D. J., Campbell, J., Stoy, K., 2010a. Anatomy-based organization of morphology and control in self-reconfigurable modular robots. Neural Computing andApplications (NCA) 19 (6), 787–805.

Christensen, D. J., Larsen, J. C., Stoy, K., 2012. Adaptivestrategy for online gait learning evaluated on the poly-morphic robotic locokit. In: Proceedings of the IEEEConference on Evolving and Adaptive Intelligent Systems(EAIS).

Christensen, D. J., Schultz, U. P., Brandt, D., Stoy, K.,2008b. A unified simulator for self-reconfigurable robots.In: Proceedings of IEEE/RSJ International Conferenceon Intelligent Robots and Systems. pp. 870–876.

Christensen, D. J., Schultz, U. P., Stoy, K., May 2010b.A distributed strategy for gait adaptation in modularrobots. In: Proceedings of the IEEE International Con-ference on Robotics and Automation (ICRA2010). An-chorage, Alaska, pp. 2765–2770.

Christensen, D. J., Sproewitz, A., Ijspeert, A. J., August2010c. Distributed online learning of central pattern gen-erators in modular robots. In: Proceedings of the 11thInternational Conference on Simulation of Adaptive Be-havior (SAB2010). Paris, France, pp. 402–412.

Davey, J., Kwok, N., Yim, M., oct. 2012. Emulating self-reconfigurable robots - design of the smores system. In:Intelligent Robots and Systems (IROS), 2012 IEEE/RSJInternational Conference on. pp. 4464 –4469.

Faına, A., Bellas, F., Souto, D., Duro, R., 2011. Towards an

evolutionary design of modular robots for industry. Foun-dations on Natural and Artificial Computation, 50–59.

Fitch, R., Butler, Z., March 2008. Million module march:Scalable locomotion for large self-reconfiguring robots.Int. J. Rob. Res. 27, 331–343.

Fukuda, T., Nakagawa, S., 1988. Dynamically reconfigurablerobotic system. In: Proceedings of the IEEE InternationalConference on Robotics & Automation (ICRA’88). pp.1581–1586.

Fukuda, T., Nakagawa, S., Kawauchi, Y., Buss, M., 1988.Self organizing robots based on cell structures - cebot. In:Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS’88). pp. 145–150.

Gilpin, K., Rus, D., sept. 2010. Modular robot systems.Robotics Automation Magazine, IEEE 17 (3), 38 –55.

Haasdijk, E., Rusu, A., Eiben, A., 2010. Hyperneat for lo-comotion control in modular robots. Evolvable Systems:From Biology to Hardware, 169–180.

Hamann, H., Stradner, J., Schmickl, T., Crailsheim, K.,2010. A hormone-based controller for evolutionary multi-modular robotics: From single modules to gait learn-ing. In: Evolutionary Computation (CEC), 2010 IEEECongress on. IEEE, pp. 1–8.

Handier, M. D., Hornby, G. S., 2007. Evolving quadrupedgaits with a heterogeneous modular robotic system. In:Proceedings of the IEEE Congress on Evolutionary Com-putation. IEEE, pp. 3631–3638.

Hornby, G., Takamura, S., Yokono, J., Hanagata, O., Ya-mamoto, T., Fujita, M., 2000. Evolving robust gaits withaibo. In: Proceedings of the IEEE International Confer-ence on Robotics and Automation. pp. 3040–3045.

Ijspeert, A. J., 2008. Central pattern generators for loco-motion control in animals and robots: a review. NeuralNetworks 21 (4), 642–653.

Ishiguro, A., Shimizu, M., Kawakatsu, T., 2004. Don’t tryto control everything!: an emergent morphology controlof a modular robot. In: Proceedings of IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems(IROS). pp. 981–985.

Kamimura, A., Kurokawa, H., Yoshida, E., Murata, S.,Tomita, K., Kokaji, S., June 2005. Automatic locomo-tion design and experiments for a modular robotic sys-tem. IEEE/ASME Transactions on Mechatronics 10 (3),314–325.

Kamimura, A., Kurokawa, H., Yoshida, E., Tomita, K., Mu-rata, S., Kokaji, S., 2003. Automatic locomotion pat-tern generation for modular robots. In: IEEE Interna-tional Conference on Robotics and Automation (ICRA).pp. 714–720.

Kernbach, S., Meister, E., Scholz, O., Humza, R., Liedke,J., Ricotti, L., Jemai, J., Havlik, J., Liu, W., 2009. Evolu-tionary robotics: The next-generation-platform for on-lineand on-board artificial evolution. In: Evolutionary Com-putation, 2009. CEC’09. IEEE Congress on. IEEE, pp.1079–1086.

Kotay, K., Rus, D., Vona, M., McGray, C., 1998. The self-reconfiguring robotic molecule. In: Proc., IEEE Int. Conf.on Robotics & Automation (ICRA’98). Leuven, Belgium,pp. 424–431.

Kurokawa, H., Kamimura, A., Yoshida, E., Tomita, K.,Kokaji, S., Murata, S., 2003. M-TRAN II: Metamorphosisfrom a four-legged walker to a caterpillar. In: Proceed-ings of IEEE/RSJ International Conference on IntelligentRobots and Systems. pp. 2454–2459.

21

Page 22: A Distributed and Morphology-Independent Strategy for ...

Kurokawa, H., Murata, S., Yoshida, E., Tomita, K., Kokaji,S., 1998. A 3-d self-reconfigurable structure and exper-iments. In: Proceedings of the IEEE/RJS InternationalConference on Intelligent Robots and Systems (IROS’98).Vol. 2. Victoria, B.C., Canada, pp. 860–665.

Kurokawa, H., Tomita, K., Kamimura, A., Kokaji, S., Hasuo,T., Murata, S., 2008. Distributed self-reconfiguration ofM-TRAN III modular robotic system. International Jour-nal of Robotics Research 27 (3-4), 373–386.

Lipson, H., Pollack, J. B., 2000. Automatic design and man-ufacture of robotic lifeforms. Nature 406, 974–978.

Maes, P., Brooks, R. A., 1990. Learning to coordinate behav-iors. In: National Conference on Artificial Intelligence. pp.796–802.

Mahdavi, S. H., Bentley, P. J., 2003. An evolutionary ap-proach to damage recovery of robot motion with mus-cles. In: In Seventh European Conference on ArtificialLife (ECAL03). Springer, pp. 248–255.

Marbach, D., Ijspeert, A. J., 2004. Co-evolution of config-uration and control for homogenous modular robots. In:Proc., 8th Int. Conf. on Intelligent Autonomous Systems.Amsterdam, Holland, pp. 712–719.

Marbach, D., Ijspeert, A. J., 2005. Online Optimization ofModular Robot Locomotion. In: Proceedings of the IEEEInt. Conference on Mechatronics and Automation (ICMA2005). pp. 248–253.

Mataric, M., Cliff, D., 1996. Challenges in evolving con-trollers for physical robots. Robotics and AutonomousSystems 19 (1), 67 – 83.

Mataric, M. J., 1997. Reinforcement learning in the multi-robot domain. Auton. Robots 4 (1), 73–83.

Meng, Y., Zhang, Y., Sampath, A., Jin, Y., Sendhoff, B.,2011. Cross-ball: a new morphogenetic self-reconfigurablemodular robot. In: Robotics and Automation (ICRA),2011 IEEE International Conference on. IEEE, pp. 267–272.

Moreno, R., Gomez, J., 2011. Central pattern generators andhormone inspired messages: A hybrid control strategy toimplement motor primitives on chain type modular recon-figurable robots. In: Robotics and Automation (ICRA),2011 IEEE International Conference on. IEEE, pp. 1014–1019.

Morimoto, J., Doya, K., 2001. Acquisition of stand-up be-havior by a real robot using hierarchical reinforcementlearning. Robotics and Autonomous Systems 36 (1), 37–51.

Murata, S., Kurokawa, H., 2012. Self-organizing robots.Springer.

Murata, S., Kurokawa, H., Kokaji, S., 1994. Self-assemblingmachine. In: Proceedings of IEEE International Confer-ence on Robotics and Automation (ICRA’94). pp. 441–448.

Murata, S., Yoshida, E., Tomita, K., Kurokawa, H.,Kamimura, A., Kokaji, S., 2000. Hardware design of mod-ular robotic system. In: Proceedings, IEEE/RSJ Int.Conf. on Intelligent Robots and Systems (IROS’00). Taka-matsu, Japan, pp. 2210–2217.

Østergaard, E. H., Kassow, K., Beck, R., Lund, H. H., 2006.Design of the ATRON lattice-based self-reconfigurablerobot. Autonomous Robots 21, 165–183.

Østergaard, E. H., Lund, H. H., 2004. Distributed clus-ter walk for the ATRON self-reconfigurable robot. In:Proceedings of the The 8th Conference on IntelligentAutonomous Systems (IAS-8). Amsterdam, Holland, pp.291–298.

Pouya, S., van den Kieboom, J., Sprowitz, A., Ijspeert,A. J., October 2010. Automatic gait generation in modu-lar robots: ”to oscillate or to rotate; that is the question”.In: Proceedings of the IEEE/RSJ International Confer-ence on Intelligent Robots and Systems. Taipei, Taiwan,pp. 514–520.

Rus, D., Vona, M., 2000. A physical implementation of theself-reconfiguring crystalline robot. In: Proceedingsof theIEEE International Conference on Robotics and Automa-tion (ICRA’00). Vol. 2. pp. 1726–1733.

Schultz, U. P., Bordignon, M., Stoy, K., 2011. Robustand reversible execution of self-reconfiguration sequences.Robotica 29 (1), 35–57.

Shen, W.-M., Krivokon, M., Chiu, H., Everist, J., Ruben-stein, M., Venkatesh, J., 2006. Multimode locomotionvia superbot reconfigurable robots. Autonomous Robots20 (2), 165–177.

Shen, W.-M., Salemi, B., Will, P., 2000. Hormones for self-reconfigurable robots. In: Proc., Int. Conf. on IntelligentAutonomous Systems (IAS-6). Venice, Italy, pp. 918–925.

Shen, W.-M., Salemi, B., Will, P., 2002. Hormone-inspiredadaptive communication and distributed control forCONRO self-reconfigurable robots. IEEE Transactions onRobotics and Automation 18, 700–712.

Sims, K., 1994. Evolving 3d morphology and behavior bycompetition. In: Brooks, R., Maes, P. (Eds.), Proc., Arti-ficial Life IV. MIT Press, pp. 28–39.

Smith, R., 2005. Open dynamics engine. www.ode.org.Sproewitz, A., Moeckel, R., Maye, J., Ijspeert, A. J., 2008.

Learning to move in modular robots using central pat-tern generators and online optimization. Int. J. Rob. Res.27 (3-4), 423–443.

Sproewitz, A., Pouya, S., Bonardi, S., van den Kieboom, J.,Mockel, R., Billard, A., Dillenbourg, P., Ijspeert, A., 2010.Roombots: Reconfigurable Robots for Adaptive Furni-ture. IEEE Computational Intelligence Magazine, specialissue on Evolutionary and developmental approaches torobotics 5 (3), 20–32.

Stoy, K., Brandt, D., Christensen, D. J., 2010. Self-Reconfigurable Robots: An Introduction. IntelligentRobotics and Autonomous Agents series. The MIT Press.

Stoy, K., Shen, W.-M., Will, P., 2002. Using rolebased control to produce locomotion in chain-type self-reconfigurable robots. IEEE Transactions on Mechatron-ics 7 (4), 410–417.

Støy, K., Shen, W.-M., Will, P., 2003. Implementing con-figuration dependent gaits in a self-reconfigurable robot.In: Proc., IEEE Int. Conf. on Robotics and Automation(ICRA’03). Taipei, Taiwan.

Sutton, R., Barto, A., 1998. Reinforcement Learning - AnIntroduction. The MIT Press.

Varshavskaya, P., Kaelbling, L., Rus, D., 2008. Automateddesign of adaptive controllers for modular robots usingreinforcement learning. International Journal of RoboticsResearch, Special Issue on Self-Reconfigurable ModularRobots 27 (3-4), 505–526.

Varshavskaya, P., Kaelbling, L. P., Rus, D., 2004. Learn-ing distributed control for modular robots. In: Intelli-gent Robots and Systems, 2004 (IROS). Proceedings ofIEEE/RSJ International Conference on. Vol. 3. pp. 2648–2653.

Wei, H., Chen, Y., Tan, J., Wang, T., 2011. Sambot:A self-assembly modular robot system. Mechatronics,IEEE/ASME Transactions on 16 (4), 745–757.

Wolpert, D. H., Macready, W. G., 1997. No free lunch the-

22

Page 23: A Distributed and Morphology-Independent Strategy for ...

orems for optimization. IEEE Transactions on Evolution-ary Computation 1 (1), 67–82.

Yim, M., 1993. A reconfigurable modular robot with manymodes of locomotion. In: Proceedings of the JSME in-ternational conference on advanced mechatronics. Tokyo,Japan, pp. 283–288.

Yim, M., 1994a. Locomotion with a unit-modular reconfig-urable robot. Ph.D. thesis, Department of Mechanical En-gineering, Stanford University.

Yim, M., 1994b. New locomotion gaits. In: Proceed-ings, International Conference on Robotics & Automation(ICRA’94). San Diego, California, USA, pp. 2508 –2514.

Yim, M., Duff, D., Roufas, K., 2000. Modular reconfig-urable robots, an approach to urban search and rescue.In: Proceedings of 1st International Workshop on Human-friendly Welfare Robotics Systems. Taejon, Korea, pp. –.

Yim, M., Duff, D., Zhang, Y., 2001a. Closed chain mo-tion with large mechanical advantage. In: Proceedings,IEEE/RSJ International Conference on Intelligent Robotsand Systems. Maui, Hawaii, USA, pp. 318–323.

Yim, M., Homans, S., Roufas, K., 2001b. Climbing withsnake-like robots. In: Proceedings of IFAC Workshop onMobile Robot Technology. Jejudo, Korea.

Yim, M., Shen, W.-M., Salemi, B., Rus, D., Moll, M.,Lipson, H., Klavins, E., March 2007a. Modular self-reconfigurable robot systems: Challenges and opportu-nities for the future. IEEE Robotics & Automation Mag-azine 14 (1), 43–52.

Yim, M., Shirmohammadi, B., Benelli, D., April 2007b. Am-phibious modular robot astrobiologist. In: Proceedings ofSPIE Unmanned Systems Technology IX. Vol. 656.

Yoshida, E., Murata, S., Kamimura, A., Tomita, K.,Kurokawa, H., Kokaji, S., 2001. A motion planningmethod for a self-reconfigurable modular robot. In: Pro-ceedings, IEEE/RSJ International Conference on Intelli-gent Robots and Systems (IROS’01). Maui, Hawaii, USA,pp. 590–597.

Yoshida, E., Murata, S., Kamimura, A., Tomita, K.,Kurokawa, H., Kokaji, S., 2002. A self-reconfigurablemodular robot: Reconfiguration planning and experi-ments. The International Journal of Robotics Research21 (10), 903–916.

Yoshida, E., Murata, S., Kamimura, A., Tomita, K.,Kurokawa, H., Kokaji, S., July 2003. Evolutionary syn-thesis of dynamic motion and reconfiguration process fora modular robot M-TRAN. In: Proceedings 2003 IEEEInternational Symposium on Computational Intelligencein Robotics and Automation (CIRA). Kobe, Japan, pp.1004–1010.

Zahadat, P., Schmickl, T., Crailsheim, K., 2012. Evolvingreactive controller for a modular robot: Benefits of theproperty of state-switching in fractal gene regulatory net-works. From Animals to Animats 12, 209–218.

Zhang, Y., Yim, M., Eldershaw, C., Duff, D., Roufas, K.,2003. Phase automata: a programming model of loco-motion gaits for scalable chain-type modular robots. In:Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS’03). Las Vegas,Nevada, USA, pp. 2442–2447.

Zhao, J., Cui, X., Zhu, Y., Tang, S., 2011. A new self-reconfigurable modular robotic system ubot: Multi-modelocomotion and self-reconfiguration. In: Robotics and Au-tomation (ICRA), 2011 IEEE International Conferenceon. IEEE, pp. 1020–1025.

Zykov, V., Bongard, J., Lipson, H., 2004. Evolving dynamic

gaits on a physical robot. In: Proceedings of Geneticand Evolutionary Computation Conference, Late Break-ing Paper, GECCO’04.

23