Delay-Aware Multi-Agent Reinforcement Learning · 2020-05-13 · solidity of this new structure...

1

Delay-Aware Multi-Agent Reinforcement Learningfor Cooperative and Competitive Environments

Baiming Chen, Mengdi Xu, Zuxin Liu, Liang Li, Ding Zhao

Abstract—Action and observation delays exist prevalently inthe real-world cyber-physical systems which may pose challengesin reinforcement learning design. It is particularly an arduoustask when handling multi-agent systems where the delay of oneagent could spread to other agents. To resolve this problem,this paper proposes a novel framework to deal with delays aswell as the non-stationary training issue of multi-agent taskswith model-free deep reinforcement learning. We formally definethe Delay-Aware Markov Game that incorporates the delays ofall agents in the environment. To solve Delay-Aware MarkovGames, we apply centralized training and decentralized executionthat allows agents to use extra information to ease the non-stationarity issue of the multi-agent systems during training,without the need of a centralized controller during execution.Experiments are conducted in multi-agent particle environmentsincluding cooperative communication, cooperative navigation,and competitive experiments. We also test the proposed algorithmin traffic scenarios that require coordination of all autonomousvehicles to show the practical value of delay-awareness. Resultsshow that the proposed delay-aware multi-agent reinforcementlearning algorithm greatly alleviates the performance degrada-tion introduced by delay. Codes and demo videos are availableat: https://github.com/baimingc/delay-aware-MARL.

Index Terms—Deep reinforcement learning, multi-agent,Markov game, delayed system.

I. INTRODUCTION

DEEP reinforcement learning (DRL) has made rapidprogress in solving challenging problems [1], [2]. Re-

cently, DRL has been used in multi-agent scenarios since manyimportant applications involve multiple agents cooperating orcompeting with each other, including multi-robot control [3],the emergence of multi-agent communication and language [4],[5], [6], multi-player games [7], etc. Learning in multi-agentscenarios is fundamentally more difficult than the single-agentcase due to many reasons, e.g., non-stationarity [8], curse ofdimensionality [9], multi-agent credit assignment [10], globalexploration [11]. For a more comprehensive review of DRLapplied in multi-agent scenarios, readers are referred to [12].

Most DRL algorithms are designed for synchronous systemswith instantaneous observation and action actuation. However,they are not able to handle the delay problem which isprevalent in many real-world applications such as roboticsystems [13] and parallel computing [14]. This issue is evenworse in multi-agent scenarios, where the delay of one agentcould spread to other coupled agents. For example, in tasksinvolving communications between agents, the action delay of

Baiming Chen and Liang Li is with the State Key Laboratory of AutomotiveSafety and Energy, Tsinghua University, Beijing 100084, China.

Mengdi Xu, Zuxin Liu and Ding Zhao are with the Department of MechanicalEngineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

a speaker would give rise to observation delays of all listenerssubscribing to this speaker. Ignoring agent delays would notonly lead to performance degradation of the agents but alsodestabilize dynamic systems [15], which may cause catastrophicfailures in safety-critical systems. One typical example is theconnected and autonomous vehicles [16]. There are manysources of delays in an autonomous driving system, such ascommunication delay, sensor delay, time for decision makingand actuator delay of the powertrain and the hydraulic brakesystem. The total delay time could add up to seconds [17], [18],[19], which must be properly handled for both performanceand safety of the connected vehicle systems.

The time-delay problem has been long studied by the controlcommunity. Several methods have been proposed such asArtstein reduction [20], [21], Smith predictor [22], [23] andH∞ controller [24]. However, most of these methods assumea known model or make heavy assumptions [25], [15], whichis usually not realistic for real-world tasks.

Recently, DRL has provided a promising way to solve com-plex sequential decision making tasks without the assumptionof a known model. By modeling the problem as a MarkovDecision Processes (MDP), DRL aims to find the optimalpolicy for the MDP [26]. However, in time-delayed systems,ignoring the delay violates the Markov property and maylead to arbitrarily suboptimal policies [27]. To retrieve theMarkov property, Walsh et al. [28] reformulated the MDPby augmenting the state space. Firoiu et al. [29] later usedthe augmented MDP to solve Atari games with action delay.Ramstedt & Pal [30] proposed a model-free DRL algorithmto efficiently solve the problem with action delays. The delayissue could also be addressed with the model-based mannerby learning a dynamics model to predict the future state asin [28], [29], [31]. However, dealing with multi-agent tasksusing model-based DRL involves agents modeling agents whichintroduces extra non-stationarity issues since policies of allagents are consistently updated [12].

Most of the previous works are limited to single-agent tasksand are not able to directly handle the non-stationarity issueintroduced by multiple agents. In this paper, we propose a novelframework to deal with delays as well as the non-stationaritytraining issue of multi-agent tasks with model-free DRL. Thecontribution of this paper is three-fold:

• We formulate a general model for multi-agent delayedsystems, Delay-Aware Markov Game (DA-MG), by aug-menting standard Markov Game with agent delays. Weprove the solidity of this new structure with the Markovreward process.

arX

iv:2

005.

0544

1v2

[cs

.LG

] 2

9 A

ug 2

020

2

• We develop a delay-aware training algorithm for DA-MGs that utilizes centralized training and decentralizedexecution to stabilize the multi-agent training.

• We test our algorithm in both benchmark platforms andpractical traffic scenarios.

The rest of the paper is organized as follows. We first reviewthe preliminaries in Section II. In Section III, we formallydefine the Delay-Aware Markov Game (DA-MGs) and prove thesolidity of this new structure with the Markov reward process.In Section IV, we introduce the proposed framework of delay-aware multi-agent reinforcement learning for DA-MGs withcentralized training and decentralized execution. In Section V,we demonstrate the performance of the proposed algorithm incooperative and competitive multi-agent particle environments,as well as traffic scenarios that require coordination ofautonomous vehicles.

II. PRELIMINARIES

A. Markov Decision Process and Markov Game

In the framework of reinforcement learning, the problem isoften represented by a Markov Decision Process (MDP). Thedefinition of a standard delay-free MDP is:

Definition 1. A Markov Decision Process (MDP) is charac-terized by a tuple with(1) state space S , (2) action space A,(3) initial state distribution ρ : S → R,(4) transition distribution p : S ×A× S → R,(5) reward function r : S ×A → R.

The agent is represented by a policy π that directs the actionselection, given the current observation. The goal of the agentis to find the optimal policy π∗ that maximizes its expectedreturn G = ΣTt=0γ

tr (st, at) where γ is a discount factor andT denotes the time horizon.

Markov game is a multi-agent extension of MDP withpartially observable environments. The definition of a standarddelay-free Markov game is:

Definition 2. A Markov Game (MG) for N agents is charac-terized by a tuple with(1) A state space S describing all agents,(2) A set of action spaces A = {A1, . . . ,AN},(3) A set of observation spaces O = {O1, . . . ,ON},(4) initial state distribution ρ : S → R,(5) transition distribution p : S ×A× S → R,(6) reward function ri : S ×Ai → R for each agent i.

Each agent i receives an individual observation from thestate oi : S → Oi and uses a policy πi : Oi × Ai → R tochoose actions. The goal of each agent is to maximize its ownexpected return Gi = ΣTt=0γ

tri(st, ait) where γ is a discount

factor and T denotes the time horizon.

B. Delay-Aware Markov Decision Process

The delay-free MDP is problematic with agent delaysand could lead to arbitrarily suboptimal policies [27]. Toretrieve the Markov property, Delay-Aware MDP (DA-MDP)is proposed [31] by augmenting the state space S to XXX :

Definition 3. A Delay-Aware Markov Decision ProcessDAMDP (E, k) = (XXX ,AAA, ρρρ,ppp,rrr) augments a Markov DecisionProcess MDP (E) = (S,A, ρ, p, r), such that(1) augmented state space XXX = S ×Ak where k denotes thedelay step,(2) action space AAA = A,(3) initial state distribution

ρρρ(xxx0) = ρρρ(s0, a0, . . . , ak−1) = ρ(s0)

k−1∏i=0

δ(ai − ci),1

where (ci)i=0:k−1 denotes the initial action sequence,(4) transition distribution

ppp(xxxt+1|xxxt, aaat)= ppp(st+1, a

(t+1)t+1 , . . . , a

(t+1)t+k |st, a

(t)t , . . . , a

(t)t+k−1, aaat)

= p(st+1|st, a(t)t )

k−1∏i=1

δ(a(t+1)t+i − a

(t)t+i)δ(a

(t+1)t+k − aaat),

(5) reward function

rrr(xxxt, aaat) = rrr(st, at, . . . , at+k−1, aaat) = r(st, at).

The state vector xxx of DA-MDP is augmented with an actionsequence being executed in the next k steps where k ∈ N isthe delay duration. The superscript of a(t2)t1 means that theaction is an element of xxxt2 and the subscript represents theaction’s executed time. aaat is the action taken at time t in aDA-MDP but executed at time t+ k due to the k-step actiondelay, i.e., aaat = at+k.

Policies interacting with the DA-MDPs, which also need tobe augmented since the dimension of state vectors has changed,are denoted by πππ.

It should be noted that both action delay and observationdelay could exist in real-world systems. However, it has beenfully discussed and proved that from the perspective of thelearning agent, observation and action delays form the samemathematical problem, since they both lead to the mismatchbetween the current observation and the executed action [32].For simplicity, we will focus on the action delay in this paper,and the algorithm and conclusions should be able to generalizeto systems with observation delays.

The above definition of DA-MDP assumes that the delaytime of the agent is an integer multiple of the time step of thesystem, which is usually not true for many real-world tasks likerobotic control. For that, Schuitema et al. [33] has proposed anapproximation approach by assuming a virtual effective actionat each discrete system time step, which could achieve first-order equivalence in linearizable systems with arbitrary delaytime. With this approximation, the above DA-MDP structurecan be adapted to systems with arbitrary-value delays.

C. Multi-Agent Deep Deterministic Policy Gradient

Reinforcement learning has been used to solve Markovgames. The simplest way is to directly train each agentwith single-agent reinforcement learning algorithms. However,

1δ is the Dirac delta function. If y ∼ δ(· − x) then y = x with probabilityone.

3

this approach will introduce the non-stationarity issue sincethe learning agent is not aware of the evolution of otheragents that treated as part of the environment, thus violatethe Markov property that is required for the convergenceof most reinforcement learning algorithms [34], [26]. Toalleviate the non-stationarity issue introduced by the multi-agent setting, several approaches have been proposed [35].Centralized training and decentralized execution is one ofthe most widely used diagram for multi-agent reinforcementlearning. Lowe et al. [36] utilized this diagram and proposedthe multi-agent deep deterministic policy gradient (MADDPG)algorithm. The core idea of MADDPG is to learn a centralizedaction-value function (critic) and a decentralized policy (actor)for each agent. The centralized critic conditions on globalinformation to alleviate the non-stationarity problem, while thedecentralized actor conditions only on private observation toavoid the need for a centralized controller during execution.

A brief description of MADDPG is as follows. In a gamewith N agents, let µµµ = {µ1, ..., µN} be the set of all agentpolicies parameterized by θθθ = {θ1, ..., θN}, respectively. Basedon the the deterministic policy gradient (DPG) algorithm [37],we can write the gradient of the objective function J(θi) =E[Gi] for agent i as:

∇θiJ(θi) =

Ex,aaa∼D[∇θiµi(ai|oi)∇aiQµµµi (x, a1, ..., aN )|ai=µi(oi)].

(1)

In Equ. 1, Qµµµi (x, a1, ..., aN ) is the centralized Q function(critic) for agent i that conditions on the global informationincluding the global state representation (x) and the actionsof all agents (a1, . . . , aN ). Under this setting, agents can havedifferent reward functions since each Qµµµi is learned separately,which means this algorithm can be used in both cooperativeand competitive tasks.

Based on deep Q-learning [1], the centralized Q functionQµi for agent i is updated as:

L(θi) = Ex,aaa,rrr,x′ [(Qµµµi (x, a1, . . . , aN )− y)2],

where y = ri + γ Qµµµ′

i (x′, a′1, . . . , a′N )∣∣a′j=µ

′j(oj)

.

Here, µµµ′ = {µθ′1 , ..., µθ′N } is the set of target policies withsoft-updated parameters θ′i to stabilize training [1].

III. DELAY-AWARE MARKOV GAME

Ignoring delays violates the Markov property in multi-agentscenarios and could lead to arbitrarily suboptimal policies. Toretrieve the Markov property, we formally define the Delay-Aware Markov Game (DA-MG) as below:

Definition 4. A Delay-Aware Markov Game with N agentsDAMG(E,kkk) = (XXX ,AAA,OOO, ρρρ,ppp,rrr) augments a Markov GameMG(E) = (S,A,O, ρ, p, r), such that(1) augmented state space XXX = S ×Ak11 × · · · × A

kNN where

ki denotes the delay step of agent i,(2) action space AAA = A,(3) initial state distribution

ρρρ(xxx0) = ρρρ(s0, a10, . . . , a

1k1−1, . . . , a

N0 , . . . , a

NkN−1)

= ρ(s0)

N∏i=1

ki−1∏j=0

δ(aij − cij),

where (cij)i=1:N,j=0:ki−1 denotes the initial action sequencesof all agents,(4) transition distribution

ppp(xxxt+1|xxxt, aaat)=ppp(st+1, a

1,(t+1)t+1 , . . . , a

1,(t+1)t+k1

, . . . , aN,(t+1)t+1 , . . . , a

N,(t+1)t+kN

|st, a1,(t)t , . . . , a1,(t)t+k1−1, . . . , a

N,(t)t , . . . , a

N,(t)t+kN−1, aaat)

=p(st+1|st, a1,(t)t , . . . , aN,(t)t )

N∏i=1

ki−1∏j=1

δ(ai,(t+1)t+j − ai,(t)t+j )

N∏i=1

δ(ai,(t+1)t+ki

− aaait)

(5) reward function

rrri(xxxt, aaat) = ri(st, ait)

for each agent i.

DA-MGs have an augmented state space S×Ak11 ×· · ·×AkNN .

ki denotes the delay step of agent i. ai,(t2)t1 is an element of xxxt2and denotes the action of agent i executed at time t1. aaat is theaction vector taken at time t in a DA-MG; its i-th element aaaitis executed by agent i at time t+ ki due to the ni-step actiondelay, i.e. aaait = ait+ki . Policies interacting with the DA-MDPs,which also need to be augmented since the dimension of statevectors has changed, are denoted by πππ.

To prove the solidity of Definition 4, we need to show thata Markov game with multi-step action delays can be convertedto a regular Markov game by state augmentation (DA-MG).We prove the equivalence of these two by comparing theircorresponding Markov Reward Processes (MRPs). The delay-free MRP for a Markov Games is:

Definition 5. A Markov Reward Process (S, ρ, κ, r) =MRP (MG(E),πππ) can be derived from a Markov GameMG(E) = (S,A,O, ρ, p, r) with a set of policy πππ ={π1, . . . , πN}, such that

κ(st+1|st) =

∫Ap(st+1|st, a1t , . . . , aNt )

N∏i=1

[πi(a

it|oit) dait

],

ri(st) =

∫Ai

ri(st, ait)πi(a

it|oit) dait,

for each agent i. κ is the state transition distribution and ris the state reward function of the MRP. E is the originalenvironment without delays.

In the delay-free framework, at each time step, the agentsselect actions based on their current observations. The actionswill immediately be executed in the environment to generatethe next observations. However, if the action delay exists, theinteraction manner between the environment and the agentschanges, and a different MRP will be generated. An illustrationof the delayed interaction between agents and the environmentis shown in Fig. 1. The agents interact with the environmentnot directly but through an action buffer.

Based on the delayed interaction manner between the agentsand the environment, the Delay-Aware MRP (DA-MRP) isdefined as below.

4

Fig. 1: Interaction manner between delayed agents and theenvironment. The agent interact with the environment notdirectly but through an action buffer. At time t, agents getthe observation ot from the environment as well as a futureaction sequences (at, . . . , at+k−1) from the action buffer. Theagents then decide their future action at+k and store them inthe action buffer. The action buffer then pops actions at to beexecuted to the environment.

Definition 6. A Delay-Aware Markov Reward Process with Nagents (XXX , ρρρ,κκκ, rrr) = DAMRP (MG(E),πππ,kkk) can be derivedfrom a Markov Game MG(E) = (S,A,O, ρ, p, r) with a setof policy πππ = {πππ1, . . . ,πππN} and a set of delay step kkk ={k1, . . . , kN}, such that(1) augmented state space

XXX = S ×Ak11 × · · · × AkNN ,

(2) initial state distribution

ρρρ(xxx0) =ρρρ(s0, a10, . . . , a

1k1−1, . . . , a

N0 , . . . , a

NkN−1)

=ρ(s0)

N∏i=1

ki−1∏j=0

δ(aij − cij),

where (cij)i=1:N,j=0:ki−1 denotes the initial action sequencesof all agents,(3) state transition distribution

κκκ (xxxt+1|xxxt)=κκκ(st+1, a

1,(t+1)t+1 , . . . , a

1,(t+1)t+k1

, . . . , aN,(t+1)t+1 , . . . , a

N,(t+1)t+kN

|st, a1,(t)t , . . . , a1,(t)t+k1−1, . . . , a

N,(t)t , . . . , a

N,(t)t+kN−1)

=p(st+1|st, a1,(t)t , . . . , aN,(t)t )

N∏i=1

ni−1∏j=1


N∏i=1

πππi(ai,(t+1)t+ki

|oooit),

(4) state-reward function

ririri(xxxt) = ri(st, ait)

for each agent i.The input of policy for agent i at time t has two parts:

oooit = (oit,obs, oit,act), where oit,obs is the observation of the

environment and oit,act is a planned action sequence for agenti of length ki that will be executed from current time step:oit,act = (ait, . . . , a

it+ki−1).

With Def. 2- 6, we are ready to prove that DA-MG is acorrect augmentation of MG with delay, as stated in Theorem 1.

Theorem 1. A set of policy πππ : A ×XXX → [0, 1] interactingwith DAMG(E,kkk) in the delay-free manner produces the sameMarkov Reward Process as πππ interacting with MG(E) with kkkaction delays for agents, i.e.

DAMRP (MG(E),πππ,nnn) = MRP (DAMG(E,nnn),πππ). (2)

The proof is provided in the Appendix.

IV. DELAY-AWARE MULTI-AGENT REINFORCEMENTLEARNING

Theorem. 1 shows that instead of solving MGs with delays,we can alternatively solve the corresponding DA-MGs directlywith DRL. Based on this finding, we proposed the framework ofDelay-Aware Multi-Agent Reinforcement Learning (DAMARL).To alleviate the non-stationarity issue introduced by the multi-agent setting, we adopt the paradigm of centralized training withdecentralized execution: for each agent, we learn a centralizedQ function (critic) which conditions on global information and adecentralized policy (actor) that only needs partial observation.An illustration of the framework is shown in Fig. 2. The mainadvantages of this structure are as follows:• The non-stationarity problem is alleviated by centralized

training since the transition distribution of the environmentis stationary when knowing all agent actions.

• A centralized controller to direct all agents, which isnot realistic to deploy in many real-world multi-agentscenarios, is not needed with decentralized policies.

• We learn an individual Q function for each agent, allowingthem to have different reward functions so that we canadopt this algorithm in both cooperative and competitivemulti-agent tasks.

• Individual Q functions and policies allow agents to havedifferent delay steps.

With the framework of DAMARL, we can adapt any DRLalgorithm with the actor-critic structure [26] to a delay-awarealgorithm such as Advantage Actor-Critic (A2C) [38], DeepDeterministic Policy Gradient (DDPG) [37] and Soft Actor-Critic (SAC) [39]. In this paper, we update the multi-agent ver-sion of DDPG with delay-awareness and propose delay-awaremulti-agent deep deterministic policy gradient (DAMA-DDPG).Concretely, in a game with N agents, let µµµ = {µ1, ..., µN} bethe set of all agent policies parameterized by θθθ = {θ1, ..., θN},respectively. Based on the the deterministic policy gradient(DPG) algorithm [37], we can write the gradient of the objectivefunction J(θi) = E[Gi] for agent i as:

∇θiJ(µµµi) =

Exxx,aaa∼D[∇θiµµµi(aaai|oooi)∇aaaiQµµµi (xxx,aaa1, ..., aaaN )|aaai=µµµi(oooi)],

(3)

The structure of Eq. 3 is in conformity with the originaldeterministic policy gradient (Eq. 1). However, the policiesµµµ, states xxx and observations ooo are augmented based on theDA-MG proposed in Def. 4. In Equ. 3, Qµµµi (xxx,aaa1, ..., aaaN ) is thecentralized Q function (critic) for agent i that conditions on the

5

Fig. 2: The framework of Delay-Aware Multi-Agent Reinforcement Learning (DAMARL). We adopt the paradigm of centralizedtraining with decentralized execution: for each agent i, we learn a centralized action-value function Qi (critic) which conditionson global information and a decentralized policy µi (actor) that only needs partial observation. For each agent, the input ofagent policy has two parts: o = (oobs, oact), where oobs is the observation of the environment and oact is a planned actionsequence that will be executed from current time step.

global information including the global state representation (xxx)and the actions of all agents (aaa1, ..., aaaN ). In the delay-awarecase, xxx could consist of the observations of all agents as well asaction sequences of all agents in a near future, xxx = (ooo1, ..., oooN ),where oooi is the input of the policy µµµi of agent i and has twoparts: oooi = (oiobs, o

iact). Here, oiobs is the observation of the

environment by the i-th agent, and oiact is a planned actionsequence for agent i of length ki that will be executed fromcurrent time step. For example, at time t, oit,act = ait:t+ki−1.The oiact is fetched from an action buffer that serves as a bridgebetween the agents and the environment, as shown in Fig. 1.

The replay buffer D is used to record historical experiencesof all agents. The centralized Q function Qµ

µµi for agent i is

updated as:

L(θi) = Exxx,aaa,rrr,xxx′ [(Qµµµi (xxx,aaa1, . . . , aaaN )− y)2],

where y = ri + γ Qµµµ′

i (xxx′, aaa′1, . . . , aaa′N )∣∣aaa′j=µµµ

′j(oooj)

.

Here, µµµ′ = {µµµθ′1 , . . . ,µµµθ′N } is the set of augmented targetpolicies with soft-updated parameters θ′i used to stabilizetraining.

The description of the full algorithm is shown in Algorithm 1.

V. EXPERIMENT

To show the performance of DAMARL, we adopt two envi-ronment platforms. One is the multi-agent particle environmentplatform proposed in [36]2 where the agents are particlesthat move on a two-dimensional plane to achieve cooperativeor competitive tasks. The other is revised based on a trafficplatform Highway [40] where we construct an unsignalizedintersection scenario that requires coordination of all road users.Implementation details and demo videos are provided in ourGitHub respiratory3. Important hyper-parameters are shown inTable I.

2https://github.com/openai/multiagent-particle-envs3https://github.com/baimingc/delay-aware-MARL

Algorithm 1 DAMA-DDPGInitialize the experience replay buffer Dfor episode = 1 to M do

Initialize the action noise Nt and the action buffer FGet initial state xxx0for t = 1 to T do

for agent i = 1 to N doget oooi = (oiobs, o

iact) from the environment and F

select action aaai = µµµθi(oooi) +Ntend forStore actions aaa = (aaa1, . . . , aaaN ) in FPop aaa = (a1, . . . , aN ) from F and execute itget the reward rrr and the new state xxx′

Store (xxx,aaa,rrr,xxx′)→ Dxxx← xxx′

for agent i = 1 to N doRandomly sample a batch of B samples(xxxb, aaab, rrrb,xxx′b) from DSet yb = rbi + γ Qµ

µµ′

i (xxx′b, aaa′1, . . . , aaa′N )|aaa′l=µµµ′

l(ooobl )

Update centralized critics with loss L(θi) =1B

∑j

(yb −Qµµµi (xxxb, aaab1, . . . , aaa

bN ))2

Update decentralized actors by ∇θiJ ≈1B

∑j ∇θiµµµi(ooobi )∇aiQ

µµµi (xxxb, aaab1, . . . , aaa

bN )∣∣aaai=µµµi(ooobi )

end forSoft update of target networks for each agent i:

θ′i ← κθi + (1− κ)θ′i

end forend for

A. Multi-Agent Particle Environment

The multi-agent particle environment is composed of severalagents and landmarks in a two-dimensional world with continu-ous state space. Agents can move in the environment and sendout messages that can be broadcasted to other agents. Sometasks are cooperative where all agents share one mutual rewardfunction, while others are competitive or mixed where agents

6

(a) Cooperative communication (b) Cooperative navigation (c) Predator-prey

Fig. 3: Tasks in the multi-agent particle environment.

TABLE I: Hyper-parameters

Description Valuelearning rate 0.01

discount factor γ 0.99soft update coefficient κ 0.01

replay buffer size 106

batch size B 1024episode length T 25

have inverse or different reward functions. In some tasks, agentsneed to communicate to achieve the goal, while in other tasksagents can only perform movements in the two-dimensionalplane. We provide details for the used environments below. Anillustration of the tasks is shown in Fig. 3.

Cooperative communication. Two cooperating agents areinvolved in this task, a speaker and a listener. They are spawnedin an environment with three landmarks of different colors.The goal of the listener is to navigate to a landmark of aparticular color. However, the listener can only observe therelative position and color of the landmarks, excluding whichlandmark it must navigate to. On the contrary, the speakerknows the color of the goal landmark, and it can send outa message at each time step which is heard by the listener.Therefore, to finish the cooperative task, the speaker and thelistener must learn to communicate so that the listener canunderstand the message from the speaker and navigate to thelandmark with the correct color.

Cooperative navigation. In this environment, three agentsmust collaborate to ’cover’ all of the three landmarks in theenvironment by movement. In addition, these agents occupy alarge physical space and are punished when they collide witheach other. Agents can observe the position and speed of allagents as well as the position of all land marks.

Predator-prey. In this environment, three slower cooperativepredators must catch up with a faster prey in a randomlygenerated environment, with two large landmarks blocking theway. Each time a collaborating predator collides with the prey,the predators will be rewarded and the prey will be punished.The agents can observe the relative position and speed of otheragents as well as the positions of the landmarks.

1) Effect of Delay-Awareness: To show the effect of delay-awareness, we first test our algorithm on cooperative tasksincluding cooperative communication, cooperative navigation,and predator-prey where we adopt a fixed prey policy andonly train the cooperative predators. To support discreteactions used for message communication in the cooperativecommunication task, we use the Gumbel-Softmax estimator[41]. Unless specified, our policies and Q functions areparameterized by two-layer neural networks with 128 units perlayer. Each experiment is run with 5 random seeds. The baselinealgorithms are DDPG and MA-DDPG that use decentralizedand centralized training, respectively, without delay-awareness.We test the proposed delay-aware algorithm DAMA-DDPG(Algorithm 1). We also adapt DAMA-DDPG to Delay-AwareDDPG (DA-DDPG) which uses decentralized training and testit for comparison.

For simplicity, we omit ’-DDPG’ when referring to analgorithm throughout the experiment part since our frameworkcan be adapted to any DRL algorithms with the actor-criticstructure. For example, ’DAMA-DDPG’ is shortened by’DAMA’.

The performance of the aforementioned algorithms incooperative tasks indicated by episodic reward is shown inFig. 4. We use ∆t to denote the simulation timestep whichis 0.1 seconds for cooperative communication and 0.2 forcooperative navigation and predator-prey. The agents are with1-step action delay in all tasks (ki = 1 for i = 1, . . . , N ).We will change the simulation timestep as well as the agentdelay time in the later part of the environment. It is clearlyshown that DAMA outperforms other algorithms in all 3 taskswith delay-awareness and centralized training, while the vanillaDDPG has the worst performance. The result of cooperationcommunication (Fig. 4a) shows the importance of centralizedtraining. In this task, the action of the speaker significantlyaffects the behavior of the listener by setting the goal, so acentralized Q function that conditions on all agent actions willgreatly stabilize training. The advantage of delay-awareness ismore significant in the high-dynamic task, predator-prey, wherea prey is running fast to escape from the agents (Fig. 4c).

We also test our algorithm in a competitive task: predator-

7

(a) Cooperative communication (∆t = 0.1 s)

(b) Cooperative navigation (∆t = 0.2 s) (c) Predator-prey with fixed prey policy (∆t = 0.2 s)

Fig. 4: Effect of delay-awareness. DAMA is the proposed algorithm that utilizes delay-awareness as well as multi-agentcentralized training. It is clearly shown that DAMA outperforms other algorithms in all 3 tasks while the vanilla DDPG has theworst performance.

TABLE II: Number of touches in predator-prey (∆t = 0.2 s)

Delay time (s) Algorithm of prey Algorithm of predators Improvement of predatorsDAMA MA DA DDPG DAMA MA DA

0.2

DAMA 10.3 ± 2.1 9.6 ± 2.0 8.5 ± 1.9 8.1 ± 1.8 2.2 1.5 0.4MA 12.1 ± 2.3 10.1 ± 2.1 9.0 ± 2.1 8.5 ± 1.8 3.6 1.6 0.5DA 15.8 ± 2.9 13.2 ± 2.7 9.7 ± 2.0 8.8 ± 1.8 7.0 4.4 0.9

DDPG 17.0 ± 3.2 14.9 ± 2.9 11.4 ± 2.2 9.1 ± 1.9 7.9 5.8 2.3

0.4

DAMA 10.1 ± 1.9 8.6 ± 1.7 8.2 ± 1.7 7.3 ± 1.6 2.8 1.3 0.9MA 14.2 ± 2.8 9.5 ± 2.0 9.0 ± 1.8 7.9 ± 1.8 6.3 1.6 1.1DA 14.9 ± 2.9 10.7 ± 2.2 9.5 ± 1.9 8.2 ± 1.8 6.7 2.5 1.3

DDPG 17.6 ± 3.3 14.5 ± 2.9 13.1 ± 2.5 8.8 ± 1.9 8.8 5.7 4.3

0.6

DAMA 9.6 ± 1.9 6.9 ± 1.7 7.6 ± 1.7 5.9 ± 1.6 3.7 1.0 1.7MA 16.0 ± 3.1 8.8 ± 2.0 12.5 ± 2.4 7.4 ± 1.7 7.6 1.4 5.1DA 13.7 ± 2.9 7.5 ± 1.7 9.3 ± 1.9 6.2 ± 1.6 7.5 1.3 3.1

DDPG 19.2 ± 3.8 13.5 ± 2.8 16.8 ± 3.4 8.3 ± 1.8 10.9 5.2 8.5

prey. To compare performance, we train agents and adversarieswith different algorithms and let them compete with eachother. The simulation timestep ∆t is set to 0.2 seconds.The delay times of agents are 0.2 s, 0.4 s and 0.6 s ineach set of experiments. We evaluate the performance of theaforementioned algorithms by the number of prey touchesby predators per episode. Since the goal of predators is totouch the prey as many times as possible, a higher number oftouches indicates a stronger predator policy against a weakerprey policy. The results on the predator-prey task are shownin Table II. All agents are trained with 30,000 episodes. It isclearly shown that DAMA has the best performance against

other algorithms: with any delay time, the DAMA predatorgets the highest touch number against the DDPG prey, whilethe DDPG predator gets the lowest touch number against theDAMA prey. Also, as the delay time increases, the delay-awareness gets more important than multi-agent centralizedtraining, as shown in the last column of Table II. When thedelay time is relatively small as 0.2 s, the improvement ofpredator policies by utilizing multi-agent centralized trainingis larger than delay-awareness. When the delay time grows to0.6 s, however, the situation is reversed and delay-awarenesshas a larger impact on the improvement of predator policies.

8

Fig. 5: Performance of delay-aware (DAMA) and delay-unaware (MA) algorithms in cooperative communication and cooperativenavigation scenarios with different agent delay times. As the delay time increases, both DAMA and MA algorithms get degradedperformance. In most cases, the DAMA algorithm maintains higher performance than the MA algorithm. The performance gapgets more significant as the delay time increases.

2) Delay Sensitivity: Sensitivity of delay is another im-portant metric to evaluate reinforcement learning algorithms.Delay-aware algorithms may experience performance degrada-tion due to state-space expansion. To show the value of delay-awareness, we compare the performance of the aforementioneddelay-aware (DAMA) and delay-unaware (MA) algorithmswith different agent delay times. We perform experimentsin cooperative communication and cooperative navigationscenarios. Results are shown in Fig. 5. The agent delaystep is ki = 0, 1, . . . , 10 in each sub-figure. The simulationtimestep ∆t is 0.1 seconds in Fig. 5a and 5c and 0.2seconds in Fig. 5b and 5d. It is shown that as the delaytime increases, both delay-aware and delay-unaware algorithmsget degraded performance. In most cases, the delay-awarealgorithm maintains higher performance than the delay-unawarealgorithm, and the performance gap gets more significant asthe delay time increases.

The only exception is in Fig. 5c when the delay time isless than 0.2 seconds. The performance of the delay-unawarealgorithm is slightly better than the delay-aware algorithm inthat situation. The primary reason is that when the delay issmall, the effect of augmented state-space on training is moresevere than the model error introduced by delay-unawareness.

B. Traffic Environment

To show the practical value of delay-awareness, we constructan unsignalized intersection scenario that requires coordinationof autonomous vehicles as shown in Fig. 7. This scenarioconsists of four vehicles coming from four directions (north,south, west, east) of the intersection, respectively. The goal ofthe vehicles is to take a left turn at the intersection. Vehicles arespawned at a random distance (di ∼ N (µ = 50 m, σ = 10 m))from the intersection center with an initial velocity (vi0 = 10m/s). Vehicles can observe the position and velocity of othervehicles as well as themselves. The routes are pre-defined andthe vehicles can decide their longitudinal acceleration based ontheir policies. The intersection is unsignalized so the vehiclesneed to coordinate to decide the sequence of passing. Vehiclesare positively rewarded if all of them successfully finish theleft turn and penalized if any collision happens.

In the traffic scenarios, the delay of an vehicle mainlyincludes: communication delay τ1, sensor delay τ2, time fordecision making τ3 , actuator delay τ4. Since it has been provedthat observation and action delays form the same mathematicalproblem [32], we approximately assume that the total delay timeis the sum of all the components: τ = τ1+τ2+τ3+τ4. Literatureshowed that the communication delay τ1 of vehicle-to-vehicle

9

Fig. 6: Typical outcomes of the unsignalized intersection scenario. In (a), the vehicles cooperated to pass the intersection bypairs since vehicles from opposite directions have non-intersecting routes; while in (b) and (c), vehicles failed to pass theintersection safely within the time limit.

Fig. 7: Unsignalized intersection scenario. The goal of thevehicles is to take a left turn at the intersection.

(V2V) systems with dedicated short-range communication(DSRC) devices can be minimal with a mean value of 1.1ms [42], [43] under good condition; the delay of sensors(cameras, LIDARs, radars, GPS, etc) τ2 is usually between 0.1and 0.3 seconds [19]; the time for decision making τ3 dependson the complexity of the algorithm and is usually minimal; theactuator delay τ4 for vehicle powertrain system and hydraulicbrake system is usually between 0.3 and 0.6 seconds [17], [18].Adding them together, a conservative estimation of the totaldelay time τ of a vehicle would be roughly between 0.4 and0.8 seconds, without communication loss. Thus, we test thedelay-aware (DAMA-DDPG) and delay-unaware (MA-DDPG)algorithms under the delay τ of 0.4 and 0.8 seconds.

There are three possible outcomes for each experiment:success, stuck, crash, as shown in Fig 6. We evaluate theperformance of the learned policies based on the success rate

and the crash rate. The results are shown in 8. It is shownthat delay-awareness drastically improves the performance ofvehicles, both in success rate and crash rate. The delay-awareagents successfully learn how to cooperate and finish bothtasks without crash, while the delay-unaware agents sufferfrom huge model error introduced by delay and are not ableto learn good policies: under 0.8 seconds delay, the successrate is less than 40% for the unsignalized intersection task.The results are reasonable: consider a velocity of 10 m/s, the0.8 seconds delay could cause a position error of 8 m, whichinjects huge uncertainty and bias to the state-understanding ofthe agents. With highly-biased observations, the agents are notable to learn good policies to finish the task.

VI. CONCLUSION

In this work, we propose a novel framework to deal withdelays as well as non-stationary training issue of multi-agenttasks with model-free deep reinforcement learning. We formallydefine a general model for multi-agent delayed systems, Delay-Aware Markov Game, by augmenting standard Markov Gamewith agent delays while maintaining the Markov property.The solidity of this new structure is proved with the Markovreward process. For the agent training part, we propose adelay-aware algorithm that adopts the paradigm of centralizedtraining with decentralized execution, and refer to it as delay-aware multi-agent reinforcement learning. Experiments areconducted in the multi-agent particle environment as well as apractical traffic simulator with autonomous vehicles. Resultsshow that the proposed delay-aware multi-agent reinforcementlearning algorithm greatly alleviate the performance degradationintroduced by delay.

10

Fig. 8: Success rate and crash rate of MA-DDPG and DAMA-DDPG in unsignalized intersection scenario. It is shown thatdelay-awareness drastically improve the performance of vehicles, both in success rate and crash rate. Delay-unaware agentssuffer from huge model error and are not able to learn a good policy.

Though the delay problem in multi-agent systems is elegantlyformalized and solved, the state augmenting procedure willincrease the dimension of the problem. A promising directionfor future work is to improve the sample efficiency byincorporating opponent modeling into the framework.

APPENDIX

Theorem 1. A set of policy πππ : A ×XXX → [0, 1] interactingwith DAMG(E,kkk) in the delay-free manner produces the sameMarkov Reward Process as πππ interacting with MG(E) with kkkaction delays for agents, i.e.

DAMRP (MG(E),πππ,nnn) = MRP (DAMG(E,nnn),πππ). (2)

Proof. For any MG(E) = (S,A,O, ρ, p, r), we need toprove that the above two MRPs are the same. Referring toDef. 4 and 5, for MRP (DAMG(E,kkk),πππ), we have(1) augmented state space XXX = S ×Ak11 × · · · × A

kNN ,

(2) initial distribution

ρρρ(xxx0) = ρρρ(s0, a10, . . . , a

1k1−1, . . . , a

N0 , . . . , a

NkN−1)

= ρ(s0)

N∏i=1

ki−1∏j=0

δ(aij − cij),

(3) transition kernel

κκκ(xxxt+1|xxxt)

=

∫Appp(xxxt+1|xxxt, aaat)πππ(aaat|xxxt) daaat

=

∫Ap(st+1|st, a1,(t)t , . . . , a

N,(t)t )

N∏i=1

ki−1∏j=1


N∏i=1

δ(ai,(k+1)t+ki

− aaait)πππ(aaat|xxxt) daaat

=p(st+1|st, a1,(t)t , . . . , aN,(t)t )

N∏i=1

ki−1∏j=1


N∏i=1

πππi(ai,(t+1)t+ki

|oit),

(4) state-reward function

rrri(xxxt) =

∫Ai

rrri(xxxt, aaat)πππi(aaait|xxxt) daaait

=

∫Ai

ri(st, ait) πππi(aaa

it|xxxt) daaait

= ri(st, ait)

∫Ai

πππi(aaait|xxxt) daaait

= ri(st, ait),

for each agent i. Since the expanded terms ofMRP (DMG(E,n),πππ) match the corresponding termsof DAMRP (MG(E),πππ, n) (Def. 6), Eq. 2 holds.

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,”arXiv preprint arXiv:1312.5602, 2013.

[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neural networksand tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[3] L. Matignon, L. Jeanpierre, and A.-I. Mouaddib, “Coordinated multi-robot exploration under communication constraints using decentralizedmarkov decision processes,” in Twenty-sixth AAAI conference on artificialintelligence, 2012.

[4] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning tocommunicate with deep multi-agent reinforcement learning,” in Advancesin neural information processing systems, 2016, pp. 2137–2145.

[5] I. Mordatch and P. Abbeel, “Emergence of grounded compositionallanguage in multi-agent populations,” in Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.

[6] S. Sukhbaatar, R. Fergus et al., “Learning multiagent communicationwith backpropagation,” in Advances in neural information processingsystems, 2016, pp. 2244–2252.

[7] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang,“Multiagent bidirectionally-coordinated nets for learning to play starcraftcombat games,” arXiv preprint arXiv:1703.10069, vol. 2, 2017.

[8] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A surveyof learning in multiagent environments: Dealing with non-stationarity,”arXiv preprint arXiv:1707.09183, 2017.

[9] L. Bu, R. Babu, B. De Schutter et al., “A comprehensive survey ofmultiagent reinforcement learning,” IEEE Transactions on Systems, Man,and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp.156–172, 2008.

11

[10] A. K. Agogino and K. Tumer, “Unifying temporal and structural creditassignment problems,” in Proceedings of the Third International JointConference on Autonomous Agents and Multiagent Systems-Volume 2.IEEE Computer Society, 2004, pp. 980–987.

[11] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent rein-forcement learners in cooperative markov games: a survey regardingcoordination problems,” The Knowledge Engineering Review, vol. 27,no. 1, pp. 1–31, 2012.

[12] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critiqueof multiagent deep reinforcement learning,” Autonomous Agents andMulti-Agent Systems, vol. 33, no. 6, pp. 750–797, 2019.

[13] M. Lazarevic, “Finite time stability analysis of pdα fractional controlof robotic time-delay systems,” Mechanics research communications,vol. 33, no. 2, pp. 269–279, 2006.

[14] R. Hannah and W. Yin, “On unbounded delays in asynchronous parallelfixed-point algorithms,” Journal of Scientific Computing, vol. 76, no. 1,pp. 299–326, 2018.

[15] K. Gu and S.-I. Niculescu, “Survey on recent results in the stability andcontrol of time-delay systems,” Journal of dynamic systems, measurement,and control, vol. 125, no. 2, pp. 158–165, 2003.

[16] S. Gong, J. Shen, and L. Du, “Constrained optimization and distributedcomputation based car following control of a connected and autonomousvehicle platoon,” Transportation Research Part B: Methodological,vol. 94, pp. 314–334, 2016.

[17] F. P. Bayan, A. D. Cornetto, A. Dunn, and E. Sauer, “Brake timingmeasurements for a tractor-semitrailer under emergency braking,” SAEInternational Journal of Commercial Vehicles, vol. 2, no. 2009-01-2918,pp. 245–255, 2009.

[18] R. Rajamani, Vehicle dynamics and control. Springer Science & BusinessMedia, 2011.

[19] M. Wang, S. P. Hoogendoorn, W. Daamen, B. van Arem, B. Shyrokau,and R. Happee, “Delay-compensating strategy to enhance string stabilityof adaptive cruise controlled vehicles,” Transportmetrica B: TransportDynamics, vol. 6, no. 3, pp. 211–229, 2018.

[20] Z. Artstein, “Linear systems with delayed controls: A reduction,” IEEETransactions on Automatic control, vol. 27, no. 4, pp. 869–879, 1982.

[21] E. Moulay, M. Dambrine, N. Yeganefar, and W. Perruquetti, “Finite-timestability and stabilization of time-delay systems,” Systems & ControlLetters, vol. 57, no. 7, pp. 561–566, 2008.

[22] K. J. Astrom, C. C. Hang, and B. Lim, “A new smith predictor forcontrolling a process with an integrator and long dead-time,” IEEEtransactions on Automatic Control, vol. 39, no. 2, pp. 343–345, 1994.

[23] M. R. Matausek and A. Micic, “On the modified smith predictor forcontrolling a process with an integrator and long dead-time,” IEEETransactions on Automatic Control, vol. 44, no. 8, pp. 1603–1606, 1999.

[24] L. Mirkin, “On the extraction of dead-time controllers from delay-freeparametrizations,” IFAC Proceedings Volumes, vol. 33, no. 23, pp. 169–174, 2000.

[25] S.-I. Niculescu, Delay effects on stability: a robust control approach.Springer Science & Business Media, 2001, vol. 269.

[26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 2018.

[27] S. P. Singh, T. Jaakkola, and M. I. Jordan, “Learning without state-estimation in partially observable markovian decision processes,” inMachine Learning Proceedings 1994. Elsevier, 1994, pp. 284–292.

[28] T. J. Walsh, A. Nouri, L. Li, and M. L. Littman, “Learning and planningin environments with delayed feedback,” Autonomous Agents and Multi-Agent Systems, vol. 18, no. 1, p. 83, 2009.

[29] V. Firoiu, T. Ju, and J. Tenenbaum, “At human speed: Deep reinforcementlearning with action delay,” arXiv preprint arXiv:1810.07286, 2018.

[30] S. Ramstedt and C. Pal, “Real-time reinforcement learning,” in Advancesin Neural Information Processing Systems, 2019, pp. 3067–3076.

[31] B. Chen, M. Xu, L. Li, and D. Zhao, “Delay-aware model-basedreinforcement learning for continuous control,” 2020.

[32] K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision processeswith delays and asynchronous cost collection,” IEEE transactions onautomatic control, vol. 48, no. 4, pp. 568–574, 2003.

[33] E. Schuitema, L. Busoniu, R. Babuska, and P. Jonker, “Control delayin reinforcement learning for real-time dynamic systems: a memorylessapproach,” in 2010 IEEE/RSJ International Conference on IntelligentRobots and Systems. IEEE, 2010, pp. 3226–3231.

[34] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperativeagents,” in Proceedings of the tenth international conference on machinelearning, 1993, pp. 330–337.

[35] G. Papoudakis, F. Christianos, A. Rahman, and S. V. Albrecht, “Dealingwith non-stationarity in multi-agent deep reinforcement learning,” arXivpreprint arXiv:1906.04737, 2019.

[36] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch,“Multi-agent actor-critic for mixed cooperative-competitive environments,”in Advances in neural information processing systems, 2017, pp. 6379–6390.

[37] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,and D. Wierstra, “Continuous control with deep reinforcement learning,”arXiv preprint arXiv:1509.02971, 2015.

[38] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deepreinforcement learning,” in International conference on machine learning,2016, pp. 1928–1937.

[39] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” arXiv preprint arXiv:1801.01290, 2018.

[40] E. Leurent, “An environment for autonomous driving decision-making,”https://github.com/eleurent/highway-env, 2018.

[41] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization withgumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.

[42] S. Biswas, R. Tatchikou, and F. Dion, “Vehicle-to-vehicle wirelesscommunication protocols for enhancing highway traffic safety,” IEEEcommunications magazine, vol. 44, no. 1, pp. 74–82, 2006.

[43] S. Ammoun, F. Nashashibi, and C. Laurgeau, “Real-time crash avoidancesystem on crossroads based on 802.11 devices and gps receivers,” in2006 IEEE Intelligent Transportation Systems Conference. IEEE, 2006,pp. 1023–1028.

https://github.com/eleurent/highway-env

Delay-Aware Multi-Agent Reinforcement Learning · 2020-05-13 · solidity of this new structure...

Documents

Transcript of Delay-Aware Multi-Agent Reinforcement Learning · 2020-05-13 · solidity of this new structure...