Deep Reinforcement Learning for Autonomous Driving: A Survey

18
1 Deep Reinforcement Learning for Autonomous Driving: A Survey B Ravi Kiran 1 , Ibrahim Sobh 2 , Victor Talpaert 3 , Patrick Mannion 4 , Ahmad A. Al Sallab 2 , Senthil Yogamani 5 , Patrick Pérez 6 Abstract—With the development of deep representation learn- ing, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning com- plex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computa- tional challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed. Index Terms—Deep reinforcement learning, Autonomous driv- ing, Imitation learning, Inverse reinforcement learning, Con- troller learning, Trajectory optimisation, Motion planning, Safe reinforcement learning. I. I NTRODUCTION Autonomous driving (AD) 1 systems constitute of multiple perception level tasks that have now achieved high precision on account of deep learning architectures. Besides the per- ception, autonomous driving systems constitute of multiple tasks where classical supervised learning methods are no more applicable. First, when the prediction of the agent’s action changes future sensor observations received from the envi- ronment under which the autonomous driving agent operates, for example the task of optimal driving speed in an urban area. Second, supervisory signals such as time to collision (TTC), lateral error w.r.t to optimal trajectory of the agent, represent the dynamics of the agent, as well uncertainty in the environment. Such problems would require defining the stochastic cost function to be maximized. Third, the agent is required to learn new configurations of the environment, as well as to predict an optimal decision at each instant while driving in its environment. This represents a high dimensional space given the number of unique configurations under which the agent & environment are observed, this is combinatorially large. In all such scenarios we are aiming to solve a sequential 1 Navya, Paris. [email protected] 2 Valeo Cairo AI team, Egypt. ibrahim.sobh, ahmad.el- sallab@{valeo.com} 3 U2IS, ENSTA Paris, Institut Polytechnique de Paris & AKKA Technolo- gies, France. [email protected] 4 School of Computer Science, National University of Ireland, Galway. [email protected] 5 Valeo Vision Systems. [email protected] 6 Valeo.ai. [email protected] 1 For easy reference, the main acronyms used in this article are listed in Appendix (Table IV). decision process, which is formalized under the classical settings of Reinforcement Learning (RL), where the agent is required to learn and represent its environment as well as act optimally given at each instant [1]. The optimal action is referred to as the policy. In this review we cover the notions of reinforcement learning, the taxonomy of tasks where RL is a promising solution especially in the domains of driving policy, predictive perception, path and motion planning, and low level controller design. We also focus our review on the different real world deployments of RL in the domain of autonomous driving expanding our conference paper [2] since their deployment has not been reviewed in an academic setting. Finally, we motivate users by demonstrating the key computational challenges and risks when applying current day RL algorithms such imitation learning, deep Q learning, among others. We also note from the trends of publications in figure 2 that the use of RL or Deep RL applied to autonomous driving or the self driving domain is an emergent field. This is due to the recent usage of RL/DRL algorithms domain, leaving open multiple real world challenges in implementation and deployment. We address the open problems in VI. The main contributions of this work can be summarized as follows: Self-contained overview of RL background for the auto- motive community as it is not well known. Detailed literature review of using RL for different au- tonomous driving tasks. Discussion of the key challenges and opportunities for RL applied to real world autonomous driving. The rest of the paper is organized as follows. Section II provides an overview of components of a typical autonomous driving system. Section III provides an introduction to rein- forcement learning and briefly discusses key concepts. Section IV discusses more sophisticated extensions on top of the basic RL framework. Section V provides an overview of RL applica- tions for autonomous driving problems. Section VI discusses challenges in deploying RL for real-world autonomous driving systems. Section VII concludes this paper with some final remarks. II. COMPONENTS OF AD SYSTEM Figure 1 comprises of the standard blocks of an AD system demonstrating the pipeline from sensor stream to control actuation. The sensor architecture in a modern autonomous driving system notably includes multiple sets of cameras, arXiv:2002.00444v2 [cs.LG] 23 Jan 2021

Transcript of Deep Reinforcement Learning for Autonomous Driving: A Survey

1

Deep Reinforcement Learningfor Autonomous Driving: A Survey

B Ravi Kiran1, Ibrahim Sobh2, Victor Talpaert3, Patrick Mannion4,Ahmad A. Al Sallab2, Senthil Yogamani5, Patrick Pérez6

Abstract—With the development of deep representation learn-ing, the domain of reinforcement learning (RL) has becomea powerful learning framework now capable of learning com-plex policies in high dimensional environments. This reviewsummarises deep reinforcement learning (DRL) algorithms andprovides a taxonomy of automated driving tasks where (D)RLmethods have been employed, while addressing key computa-tional challenges in real world deployment of autonomous drivingagents. It also delineates adjacent domains such as behaviorcloning, imitation learning, inverse reinforcement learning thatare related but are not classical RL algorithms. The role ofsimulators in training agents, methods to validate, test androbustify existing solutions in RL are discussed.

Index Terms—Deep reinforcement learning, Autonomous driv-ing, Imitation learning, Inverse reinforcement learning, Con-troller learning, Trajectory optimisation, Motion planning, Safereinforcement learning.

I. INTRODUCTION

Autonomous driving (AD)1 systems constitute of multipleperception level tasks that have now achieved high precisionon account of deep learning architectures. Besides the per-ception, autonomous driving systems constitute of multipletasks where classical supervised learning methods are no moreapplicable. First, when the prediction of the agent’s actionchanges future sensor observations received from the envi-ronment under which the autonomous driving agent operates,for example the task of optimal driving speed in an urbanarea. Second, supervisory signals such as time to collision(TTC), lateral error w.r.t to optimal trajectory of the agent,represent the dynamics of the agent, as well uncertainty inthe environment. Such problems would require defining thestochastic cost function to be maximized. Third, the agent isrequired to learn new configurations of the environment, aswell as to predict an optimal decision at each instant whiledriving in its environment. This represents a high dimensionalspace given the number of unique configurations under whichthe agent & environment are observed, this is combinatoriallylarge. In all such scenarios we are aiming to solve a sequential

1Navya, Paris. � [email protected] Cairo AI team, Egypt. � ibrahim.sobh, ahmad.el-

sallab@{valeo.com}3U2IS, ENSTA Paris, Institut Polytechnique de Paris & AKKA Technolo-

gies, France. � [email protected] of Computer Science, National University of Ireland, Galway. �

[email protected] Vision Systems. � [email protected]. � [email protected] easy reference, the main acronyms used in this article are listed in

Appendix (Table IV).

decision process, which is formalized under the classicalsettings of Reinforcement Learning (RL), where the agent isrequired to learn and represent its environment as well asact optimally given at each instant [1]. The optimal actionis referred to as the policy.

In this review we cover the notions of reinforcementlearning, the taxonomy of tasks where RL is a promisingsolution especially in the domains of driving policy, predictiveperception, path and motion planning, and low level controllerdesign. We also focus our review on the different real worlddeployments of RL in the domain of autonomous drivingexpanding our conference paper [2] since their deployment hasnot been reviewed in an academic setting. Finally, we motivateusers by demonstrating the key computational challenges andrisks when applying current day RL algorithms such imitationlearning, deep Q learning, among others. We also note fromthe trends of publications in figure 2 that the use of RL orDeep RL applied to autonomous driving or the self drivingdomain is an emergent field. This is due to the recent usage ofRL/DRL algorithms domain, leaving open multiple real worldchallenges in implementation and deployment. We address theopen problems in VI.

The main contributions of this work can be summarized asfollows:

• Self-contained overview of RL background for the auto-motive community as it is not well known.

• Detailed literature review of using RL for different au-tonomous driving tasks.

• Discussion of the key challenges and opportunities forRL applied to real world autonomous driving.

The rest of the paper is organized as follows. Section IIprovides an overview of components of a typical autonomousdriving system. Section III provides an introduction to rein-forcement learning and briefly discusses key concepts. SectionIV discusses more sophisticated extensions on top of the basicRL framework. Section V provides an overview of RL applica-tions for autonomous driving problems. Section VI discusseschallenges in deploying RL for real-world autonomous drivingsystems. Section VII concludes this paper with some finalremarks.

II. COMPONENTS OF AD SYSTEM

Figure 1 comprises of the standard blocks of an AD systemdemonstrating the pipeline from sensor stream to controlactuation. The sensor architecture in a modern autonomousdriving system notably includes multiple sets of cameras,

arX

iv:2

002.

0044

4v2

[cs

.LG

] 2

3 Ja

n 20

21

2

Fig. 1. Standard components in a modern autonomous driving systems pipeline listing the various tasks. The key problems addressed by these modules areScene Understanding, Decision and Planning.

radars and LIDARs as well as a GPS-GNSS system forabsolute localisation and inertial measurement Units (IMUs)that provide 3D pose of the vehicle in space.

The goal of the perception module is the creation of anintermediate level representation of the environment state (forexample bird-eye view map of all obstacles and agents) that isto be later utilised by a decision making system that ultimatelyproduces the driving policy. This state would include laneposition, drivable zone, location of agents such as cars &pedestrians, state of traffic lights and others. Uncertaintiesin the perception propagate to the rest of the informationchain. Robust sensing is critical for safety thus using redundantsources increases confidence in detection. This is achievedby a combination of several perception tasks like semanticsegmentation [3], [4], motion estimation [5], depth estimation[6], soiling detection [7], etc which can be efficiently unifiedinto a multi-task model [8], [9].

A. Scene UnderstandingThis key module maps the abstract mid-level representation

of the perception state obtained from the perception module tothe high level action or decision making module. Conceptually,three tasks are grouped by this module: Scene understanding,Decision and Planning as seen in figure 1 module aims toprovide a higher level understanding of the scene, it is builton top of the algorithmic tasks of detection or localisation.By fusing heterogeneous sensor sources, it aims to robustlygeneralise to situations as the content becomes more abstract.This information fusion provides a general and simplifiedcontext for the Decision making components.

Fusion provides a sensor agnostic representation of the envi-ronment and models the sensor noise and detection uncertain-ties across multiple modalities such as LIDAR, camera, radar,ultra-sound. This basically requires weighting the predictionsin a principled way.

B. Localization and MappingMapping is one of the key pillars of automated driving [10].

Once an area is mapped, current position of the vehicle can

be localized within the map. The first reliable demonstrationsof automated driving by Google were primarily reliant onlocalisation to pre-mapped areas. Because of the scale ofthe problem, traditional mapping techniques are augmentedby semantic object detection for reliable disambiguation. Inaddition, localised high definition maps (HD maps) can beused as a prior for object detection.

C. Planning and Driving policy

Trajectory planning is a crucial module in the autonomousdriving pipeline. Given a route-level plan from HD maps orGPS based maps, this module is required to generate motion-level commands that steer the agent.

Classical motion planning ignores dynamics and differentialconstraints while using translations and rotations required tomove an agent from source to destination poses [11]. A roboticagent capable of controlling 6-degrees of freedom (DOF) issaid to be holonomic, while an agent with fewer controllableDOFs than its total DOF is said to be non-holonomic. Classicalalgorithms such as A∗ algorithm based on Djisktra’s algorithmdo not work in the non-holonomic case for autonomousdriving. Rapidly-exploring random trees (RRT) [12] are non-holonomic algorithms that explore the configuration space byrandom sampling and obstacle free path generation. There arevarious versions of RRT currently used in for motion planningin autonomous driving pipelines.

D. Control

A controller defines the speed, steering angle and brakingactions necessary over every point in the path obtained froma pre-determined map such as Google maps, or expert drivingrecording of the same values at every waypoint. Trajectorytracking in contrast involves a temporal model of the dynamicsof the vehicle viewing the waypoints sequentially over time.

Current vehicle control methods are founded in classical op-timal control theory which can be stated as a minimisation ofa cost function x = f (x(t),u(t)) defined over a set of states x(t)and control actions u(t). The control input is usually defined

3

Fig. 2. Trend of publications for keywords 1. "reinforcement learning", 2."deep reinforcement", and 3."reinforcement learning" AND ("autonomous cars" OR"autonomous vehicles" OR "self driving") for academic publication trends from this [13].

over a finite time horizon and restricted on a feasible statespace x ∈ Xfree [14]. The velocity control are based on classicalmethods of closed loop control such as PID (proportional-integral-derivative) controllers, MPC (Model predictive con-trol). PIDs aim to minimise a cost function constituting ofthree terms current error with proportional term, effect ofpast errors with integral term, and effect of future errors withthe derivative term. While the family of MPC methods aimto stabilize the behavior of the vehicle while tracking thespecified path [15]. A review on controllers, motion planningand learning based approaches for the same are provided inthis review [16] for interested readers. Optimal control andreinforcement learning are intimately related, where optimalcontrol can be viewed as a model based reinforcement learningproblem where the dynamics of the vehicle/environment aremodeled by well defined differential equations. Reinforcementlearning methods were developed to handle stochastic controlproblems as well ill-posed problems with unknown rewardsand state transition probabilities. Autonomous vehicle stochas-tic control is large domain, and we advise readers to read thesurvey on this subject by authors in [17].

III. REINFORCEMENT LEARNING

Machine learning (ML) is a process whereby a computerprogram learns from experience to improve its performance ata specified task [18]. ML algorithms are often classified underone of three broad categories: supervised learning, unsuper-vised learning and reinforcement learning (RL). Supervisedlearning algorithms are based on inductive inference wherethe model is typically trained using labelled data to performclassification or regression, whereas unsupervised learning en-compasses techniques such as density estimation or clusteringapplied to unlabelled data. By contrast, in the RL paradigman autonomous agent learns to improve its performance atan assigned task by interacting with its environment. Russeland Norvig define an agent as “anything that can be viewedas perceiving its environment through sensors and actingupon that environment through actuators” [19]. RL agentsare not told explicitly how to act by an expert; rather anagent’s performance is evaluated by a reward function R.For each state experienced, the agent chooses an action andreceives an occasional reward from its environment based onthe usefulness of its decision. The goal for the agent is tomaximize the cumulative rewards received over its lifetime.Gradually, the agent can increase its long-term reward by

Vehicle/RL AgentState st ∈ S

S either continuous

or discrete

ChallengesSample Complexity, Validation,

Safe-exploration, Credit assignment,Simulator-Real Gap, Learning

Function Approximatorsot → st (SRL)

CNNs

RL methodsValue/Policy-based

Actor-CriticOn/Off policy

Model-based/Model-free π : S → ADriving Policy

stochastic/deterministic

EnvironmentReal-world Simulator

simulator-to-real gap

action atdiscrete/continuous

reward r tobservation ot

Fig. 3. A graphical decomposition of the different components of an RLalgorithm. It also demonstrates the different challenges encountered whiletraining a D(RL) algorithm.

exploiting knowledge learned about the expected utility (i.e.discounted sum of expected future rewards) of different state-action pairs. One of the main challenges in reinforcementlearning is managing the trade-off between exploration andexploitation. To maximize the rewards it receives, an agentmust exploit its knowledge by selecting actions which areknown to result in high rewards. On the other hand, todiscover such beneficial actions, it has to take the risk oftrying new actions which may lead to higher rewards thanthe current best-valued actions for each system state. In otherwords, the learning agent has to exploit what it already knowsin order to obtain rewards, but it also has to explore theunknown in order to make better action selections in the future.Examples of strategies which have been proposed to managethis trade-off include ε-greedy and softmax. When adoptingthe ubiquitous ε-greedy strategy, an agent either selects anaction at random with probability 0 < ε < 1, or greedilyselects the highest valued action for the current state withthe remaining probability 1− ε. Intuitively, the agent shouldexplore more at the beginning of the training process whenlittle is known about the problem environment. As trainingprogresses, the agent may gradually conduct more exploitationthan exploration. The design of exploration strategies for RLagents is an area of active research (see e.g. [20]).

4

Markov decision processes (MDPs) are considered the defacto standard when formalising sequential decision makingproblems involving a single RL agent [21]. An MDP consistsof a set S of states, a set A of actions, a transition functionT and a reward function R [22], i.e. a tuple < S, A,T,R >.When in any state s ∈ S, selecting an action a ∈ A will resultin the environment entering a new state s′ ∈ S with a transitionprobability T(s,a, s′) ∈ (0,1), and give a reward R(s,a). Thisprocess is illustrated in Fig. 3. The stochastic policy π : S →D

is a mapping from the state space to a probability over the setof actions, and π(a|s) represents the probability of choosingaction a at state s. The goal is to find the optimal policyπ∗, which results in the highest expected sum of discountedrewards [21]:

π∗ = argmaxπ

{H−1∑k=0

γkrk+1 | s0 = s}

︸ ︷︷ ︸:=Vπ(s)

, (1)

for all states s ∈ S, where rk = R(sk,ak) is the reward at timek and Vπ(s), the ‘value function’ at state s following a policyπ, is the expected ‘return’ (or ‘utility’) when starting at s andfollowing the policy π thereafter [1]. An important, relatedconcept is the action-value function, a.k.a.‘Q-function’ definedas:

Qπ(s,a)= Eπ

{H−1∑k=0

γkrk+1 | s0 = s,a0 = a}

. (2)

The discount factor γ ∈ [0,1] controls how an agent regardsfuture rewards. Low values of γ encourage myopic behaviourwhere an agent will aim to maximise short term rewards,whereas high values of γ cause agents to be more forward-looking and to maximise rewards over a longer time frame.The horizon H refers to the number of time steps in theMDP. In infinite-horizon problems H =∞, whereas in episodicdomains H has a finite value. Episodic domains may terminateafter a fixed number of time steps, or when an agent reachesa specified goal state. The last state reached in an episodicdomain is referred to as the terminal state. In finite-horizonor goal-oriented domains discount factors of (close to) 1 maybe used to encourage agents to focus on achieving the goal,whereas in infinite-horizon domains lower discount factorsmay be used to strike a balance between short- and long-term rewards. If the optimal policy for a MDP is known,then Vπ∗ may be used to determine the maximum expecteddiscounted sum of rewards available from any arbitrary initialstate. A rollout is a trajectory produced in the state spaceby sequentially applying a policy to an initial state. A MDPsatisfies the Markov property, i.e. system state transitions aredependent only on the most recent state and action, not onthe full history of states and actions in the decision process.Moreover, in many real-world application domains, it is notpossible for an agent to observe all features of the environmentstate; in such cases the decision-making problem is formulatedas a partially-observable Markov decision process (POMDP).Solving a reinforcement learning task means finding a policyπ that maximises the expected discounted sum of rewardsover trajectories in the state space. RL agents may learnvalue function estimates, policies and/or environment models

directly. Dynamic programming (DP) refers to a collection ofalgorithms that can be used to compute optimal policies givena perfect model of the environment in terms of reward andtransition functions. Unlike DP, in Monte Carlo methods thereis no assumption of complete environment knowledge. MonteCarlo methods are incremental in an episode-by-episode sense.Upon the completion of an episode, the value estimates andpolicies are updated. Temporal Difference (TD) methods, onthe other hand, are incremental in a step-by-step sense, makingthem applicable to non-episodic scenarios. Like Monte Carlomethods, TD methods can learn directly from raw experiencewithout a model of the environment’s dynamics. Like DP, TDmethods learn their estimates based on other estimates.

A. Value-based methods

Q-learning is one of the most commonly used RL algo-rithms. It is a model-free TD algorithm that learns estimates ofthe utility of individual state-action pairs (Q-functions definedin Eqn. 2). Q-learning has been shown to converge to theoptimum state-action values for a MDP with probability 1,so long as all actions in all states are sampled infinitelyoften and the state-action values are represented discretely[23]. In practice, Q-learning will learn (near) optimal state-action values provided a sufficient number of samples areobtained for each state-action pair. If a Q-learning agent hasconverged to the optimal Q values for a MDP and selectsactions greedily thereafter, it will receive the same expectedsum of discounted rewards as calculated by the value functionwith π∗ (assuming that the same arbitrary initial starting stateis used for both). Agents implementing Q-learning update theirQ values according to the following update rule:

Q(s,a)←Q(s,a)+α[r+γmaxa′∈A

Q(s′,a′)−Q(s,a)], (3)

where Q(s,a) is an estimate of the utility of selecting actiona in state s, α ∈ [0,1] is the learning rate which controls thedegree to which Q values are updated at each time step, andγ ∈ [0,1] is the same discount factor used in Eqn. 1. Thetheoretical guarantees of Q-learning hold with any arbitraryinitial Q values [23]; therefore the optimal Q values for a MDPcan be learned by starting with any initial action value functionestimate. The initialisation can be optimistic (each Q(s,a)returns the maximum possible reward), pessimistic (minimum)or even using knowledge of the problem to ensure fasterconvergence. Deep Q-Networks (DQN) [24] incorporates avariant of the Q-learning algorithm [25], by using deep neuralnetworks (DNNs) as a non-linear Q function approximatorover high-dimensional state spaces (e.g. the pixels in a frameof an Atari game). Practically, the neural network predicts thevalue of all actions without the use of any explicit domain-specific information or hand-designed features. DQN appliesexperience replay technique to break the correlation betweensuccessive experience samples and also for better sampleefficiency. For increased stability, two networks are used wherethe parameters of the target network for DQN are fixed fora number of iterations while updating the parameters of theonline network. Readers are directed to sub-section III-E fora more detailed introduction to the use of DNNs in Deep RL.

5

B. Policy-based methods

The difference between value-based and policy-based meth-ods is essentially a matter of where the burden of optimalityresides. Both method types must propose actions and evaluatethe resulting behaviour, but while value-based methods focuson evaluating the optimal cumulative reward and have a policyfollows the recommendations, policy-based methods aim toestimate the optimal policy directly, and the value is a sec-ondary if calculated at all. Typically, a policy is parameterisedas a neural network πθ. Policy gradient methods use gradientdescent to estimate the parameters of the policy that maximisethe expected reward. The result can be a stochastic policywhere actions are selected by sampling, or a deterministicpolicy. Many real-world applications have continuous actionspaces. Deterministic policy gradient (DPG) algorithms [26][1] allow reinforcement learning in domains with continuousactions. Silver et al. [26] proved that a deterministic policygradient exists for MDPs satisfying certain conditions, andthat deterministic policy gradients have a simple model-freeform that follows the gradient of the action-value function.As a result, instead of integrating over both state and actionspaces in stochastic policy gradients, DPG integrates overthe state space only leading to fewer samples in problemswith large action spaces. To ensure sufficient exploration,actions are chosen using a stochastic policy, while learning adeterministic target policy. The REINFORCE [27] algorithmis a straight forward policy-based method. The discountedcumulative reward gt = ∑H−1

k=0 γkrk+t+1 at one time step iscalculated by playing the entire episode, so no estimator isrequired for policy evaluation. The parameters are updated intothe direction of the performance gradient:

θ← θ+αγt g∇ logπθ(a|s), (4)

where α is the learning rate for a stable incremental update.Intuitively, we want to encourage state-action pairs that resultin the best possible returns. Trust Region Policy Optimization(TRPO) [28], works by preventing the updated policies fromdeviating too much from previous policies, thus reducing thechance of a bad update. TRPO optimises a surrogate objectivefunction where the basic idea is to limit each policy gradientupdate as measured by the Kullback-Leibler (KL) divergencebetween the current and the new proposed policy. This methodresults in monotonic improvements in policy performance.While Proximal Policy Optimization (PPO) [29] proposed aclipped surrogate objective function by adding a penalty forhaving a too large policy change. Accordingly, PPO policyoptimisation is simpler to implement, and has better samplecomplexity while ensuring the deviation from the previouspolicy is relatively small.

C. Actor-critic methods

Actor-critic methods are hybrid methods that combine thebenefits of policy-based and value-based algorithms. Thepolicy structure that is responsible for selecting actions isknown as the ‘actor’. The estimated value function criticisesthe actions made by the actor and is known as the ‘critic’.After each action selection, the critic evaluates the new state to

determine whether the result of the selected action was betteror worse than expected. Both networks need their gradient tolearn. Let J(θ) := Eπθ [r] represent a policy objective function,where θ designates the parameters of a DNN. Policy gradientmethods search for local maximum of J(θ). Since optimizationin continuous action spaces could be costly and slow, theDPG (Direct Policy Gradient) algorithm represents actionsas parameterised function µ(s|θµ), where θµ refers to theparameters of the actor network. Then the unbiased estimateof the policy gradient gradient step is given as:

∇θJ =−Eπθ{(g−b) logπθ(a|s)

}, (5)

where b is the baseline. While using b ≡ 0 is the simplificationthat leads to the REINFORCE formulation. Williams [27]explains a well chosen baseline can reduce variance leadingto a more stable learning. The baseline, b can be chosenas Vπ(s), Qπ(s,a) or ‘Advantage’ Aπ(s,a) based methods.Deep Deterministic Policy Gradient (DDPG) [30] is a model-free, off-policy (please refer to subsection III-D for a detaileddistinction), actor-critic algorithm that can learn policies forcontinuous action spaces using deep neural net based functionapproximation, extending prior work on DPG to large andhigh-dimensional state-action spaces. When selecting actions,exploration is performed by adding noise to the actor policy.Like DQN, to stabilise learning a replay buffer is used tominimize data correlation. A separate actor-critic specifictarget network is also used. Normal Q-learning is adapted witha restricted number of discrete actions, and DDPG also needsa straightforward way to choose an action. Starting from Q-learning, we extend Eqn. 2 to define the optimal Q-value andoptimal action as Q∗ and a∗.

Q∗(s,a)=maxπ

Qπ(s,a), a∗ = argmaxa

Q∗(s,a). (6)

In the case of Q-learning, the action is chosen according tothe Q-function as in Eqn. 6. But DDPG chains the evaluationof Q after the action has already been chosen according to thepolicy. By correcting the Q-values towards the optimal valuesusing the chosen action, we also update the policy towards theoptimal action proposition. Thus two separate networks workat estimating Q∗ and π∗.

Asynchronous Advantage Actor Critic (A3C) [31] usesasynchronous gradient descent for optimization of deep neuralnetwork controllers. Deep reinforcement learning algorithmsbased on experience replay such as DQN and DDPG havedemonstrated considerable success in difficult domains suchas playing Atari games. However, experience replay uses alarge amount of memory to store experience samples andrequires off-policy learning algorithms. In A3C, instead ofusing an experience replay buffer, agents asynchronouslyexecute on multiple parallel instances of the environment. Inaddition to the reducing correlation of the experiences, theparallel actor-learners have a stabilizing effect on trainingprocess. This simple setup enables a much larger spectrumof on-policy as well as off-policy reinforcement learningalgorithms to be applied robustly using deep neural networks.A3C exceeded the performance of the previous state-of-the-art at the time on the Atari domain while training for half

6

the time on a single multi-core CPU instead of a GPU bycombining several ideas. It also demonstrates how using anestimate of the value function as the previously explainedbaseline b reduces variance and improves convergence time.By defining the advantage as Aπ(a, s) = Qπ(s,a)−Vπ(s), theexpression of the policy gradient from Eqn. 5 is rewrittenas ∇θL = −Eπθ {Aπ(a, s) logπθ(a|s)}. The critic is trained tominimize 1

2

∥∥Aπθ (a, s)∥∥2. The intuition of using advantage

estimates rather than just discounted returns is to allow theagent to determine not just how good its actions were, but alsohow much better they turned out to be than expected, leadingto reduced variance and more stable training. The A3C modelalso demonstrated good performance in 3D environments suchas labyrinth exploration. Advantage Actor Critic (A2C) isa synchronous version of the asynchronous advantage actorcritic model, that waits for each agent to finish its experiencebefore conducting an update. The performance of both A2Cand A3C is comparable. Most greedy policies must alternatebetween exploration and exploitation, and good explorationvisits the states where the value estimate is uncertain. Thisway, exploration focuses on trying to find the most uncertainstate paths as they bring valuable information. In addition toadvantage, explained earlier, some methods use the entropy asthe uncertainty quantity. Most A3C implementations includethis as well. Two methods with common authors are energy-based policies [32] and more recent and with widespread use,the Soft Actor Critic (SAC) algorithm [33], both rely on addingan entropy term to the reward function, so we update the policyobjective from Eqn. 1 to Eqn. 7. We refer readers to [33] foran in depth explanation of the expression

π∗MaxEnt = argmax

πEπ

{∑t

[r(st,at)+αH(π(.|st))]}, (7)

shown here for illustration of how the entropy H is added.

D. Model-based (vs. Model-free) & On/Off Policy methods

In practical situations, interacting with the real environmentcould be limited due to many reasons including safety andcost. Learning a model for environment dynamics may reducethe amount of interactions required with the real environ-ment. Moreover, exploration can be performed on the learnedmodels. In the case of model-based approaches (e.g. Dyna-Q [34], R-max [35]), agents attempt to learn the transitionfunction T and reward function R, which can be used whenmaking action selections. Keeping a model approximation ofthe environment means storing knowledge of its dynamics, andallows for fewer, and sometimes, costly environment interac-tions. By contrast, in model-free approaches such knowledgeis not a requirement. Instead, model-free learners sample theunderlying MDP directly in order to gain knowledge aboutthe unknown model, in the form of value function estimatesfor example. In Dyna-2 [36], the learning agent stores long-term and short-term memories, where a memory is definedas the set of features and corresponding parameters used byan agent to estimate the value function. Long-term memoryis for general domain knowledge which is updated from realexperience, while short-term memory is for specific local

knowledge about the current situation, and the value functionis a linear combination of long and short term memories.

Learning algorithms can be on-policy or off-policy depend-ing on whether the updates are conducted on fresh trajectoriesgenerated by the policy or by another policy, that could begenerated by an older version of the policy or provided by anexpert. On-policy methods such as SARSA [37], estimate thevalue of a policy while using the same policy for control.However, off-policy methods such as Q-learning [25], usetwo policies: the behavior policy, the policy used to generatebehavior; and the target policy, the one being improved on. Anadvantage of this separation is that the target policy may bedeterministic (greedy), while the behavior policy can continueto sample all possible actions, [1].

E. Deep reinforcement learning (DRL)

Tabular representations are the simplest way to store learnedestimates (of e.g. values, policies or models), where each state-action pair has a discrete estimate associated with it. Whenestimates are represented discretely, each additional featuretracked in the state leads to an exponential growth in thenumber of state-action pair values that must be stored [38].This problem is commonly referred to in the literature as the“curse of dimensionality”, a term originally coined by Bellman[39]. In simple environments this is rarely an issue, but itmay lead to an intractable problem in real-world applications,due to memory and/or computational constraints. Learningover a large state-action space is possible, but may take anunacceptably long time to learn useful policies. Many real-world domains feature continuous state and/or action spaces;these can be discretised in many cases. However, large discreti-sation steps may limit the achievable performance in a domain,whereas small discretisation steps may result in a large state-action space where obtaining a sufficient number of samplesfor each state-action pair is impractical. Alternatively, functionapproximation may be used to generalise across states and/oractions, whereby a function approximator is used to store andretrieve estimates. Function approximation is an active areaof research in RL, offering a way to handle continuous stateand/or action spaces, mitigate against the state-action spaceexplosion and generalise prior experience to previously unseenstate-action pairs. Tile coding is one of the simplest formsof function approximation, where one tile represents multiplestates or state-action pairs [38]. Neural networks are alsocommonly used to implement function approximation, one ofthe most famous examples being Tesuaro’s application of RLto backgammon [40]. Recent work has applied deep neuralnetworks as a function approximation method; this emergingparadigm is known as deep reinforcement learning (DRL).DRL algorithms have achieved human level performance (orabove) on complex tasks such as playing Atari games [24] andplaying the board game Go [41].

In DQN [24] it is demonstrated how a convolutional neuralnetwork can learn successful control policies from just rawvideo data for different Atari environments. The networkwas trained end-to-end and was not provided with any gamespecific information. The input to the convolutional neural

7

network consists of a 84×84×4 tensor of 4 consecutive stackedframes used to capture the temporal information. Throughconsecutive layers, the network learns how to combine featuresin order to identify the action most likely to bring the bestoutcome. One layer consists of several convolutional filters.For instance, the first layer uses 32 filters with 8×8 kernelswith stride 4 and applies a rectifier non linearity. The secondlayer is 64 filters of 4×4 with stride 2, followed by a rectifiernon-linearity. Next comes a third convolutional layer of 64filters of 3×3 with stride 1 followed by a rectifier. The lastintermediate layer is composed of 512 rectifier units fullyconnected. The output layer is a fully-connected linear layerwith a single output for each valid action. For DQN trainingstability, two networks are used while the parameters of thetarget network are fixed for a number of iterations whileupdating the online network parameters. For practical reasons,the Q(s,a) function is modeled as a deep neural networkthat predicts the value of all actions given the input state.Accordingly, deciding what action to take requires performinga single forward pass of the network. Moreover, in orderto increase sample efficiency, experiences of the agent arestored in a replay memory (experience replay), where the Q-learning updates are conducted on randomly selected samplesfrom the replay memory. This random selection breaks thecorrelation between successive samples. Experience replayenables reinforcement learning agents to remember and reuseexperiences from the past where observed transitions are storedfor some time, usually in a queue, and sampled uniformly fromthis memory to update the network. However, this approachsimply replays transitions at the same frequency that they wereoriginally experienced, regardless of their significance. Analternative method is to use two separate experience buckets,one for positive and one for negative rewards [42]. Then a fixedfraction from each bucket is selected to replay. This methodis only applicable in domains that have a natural notion ofbinary experience. Experience replay has also been extendedwith a framework for prioritising experience [43], whereimportant transitions, based on the TD error, are replayedmore frequently, leading to improved performance and fastertraining when compared to the standard experience replayapproach.

The max operator in standard Q-learning and DQN uses thesame values both to select and to evaluate an action resultingin over optimistic value estimates. In Double DQN (D-DQN)[44] the over estimation problem in DQN is tackled where thegreedy policy is evaluated according to the online network anduses the target network to estimate its value. It was shown thatthis algorithm not only yields more accurate value estimates,but leads to much higher scores on several games.

In Dueling network architecture [45] the state value functionand associated advantage function are estimated, and thencombined together to estimate action value function. Theadvantage of the dueling architecture lies partly in its abilityto learn the state-value function efficiently. In a single-streamarchitecture only the value for one of the actions is updated.However in dueling architecture, the value stream is updatedwith every update, allowing for better approximation of thestate values, which in turn need to be accurate for temporal

difference methods like Q-learning.DRQN [46] applied a modification to the DQN by com-

bining a Long Short Term Memory (LSTM) with a Deep Q-Network. Accordingly, the DRQN is capable of integrating in-formation across frames to detect information such as velocityof objects. DRQN showed to generalize its policies in case ofcomplete observations and when trained on Atari games andevaluated against flickering games, it was shown that DRQNgeneralizes better than DQN.

IV. EXTENSIONS TO REINFORCEMENT LEARNING

This section introduces and discusses some of the mainextensions to the basic single-agent RL paradigms whichhave been introduced over the years. As well as broadeningthe applicability of RL algorithms, many of the extensionsdiscussed here have been demonstrated to improve scalability,learning speed and/or converged performance in complexproblem domains.

A. Reward shaping

As noted in Section III, the design of the reward functionis crucial: RL agents seek to maximise the return from thereward function, therefore the optimal policy for a domainis defined with respect to the reward function. In many real-world application domains, learning may be difficult due tosparse and/or delayed rewards. RL agents typically learn howto act in their environment guided merely by the reward signal.Additional knowledge can be provided to a learner by theaddition of a shaping reward to the reward naturally receivedfrom the environment, with the goal of improving learningspeed and converged performance. This principle is referredto as reward shaping. The term shaping has its origins in thefield of experimental psychology, and describes the idea ofrewarding all behaviour that leads to the desired behaviour.Skinner [47] discovered while training a rat to push a lever thatany movement in the direction of the lever had to be rewardedto encourage the rat to complete the task. Analogously tothe rat, a RL agent may take an unacceptably long time todiscover its goal when learning from delayed rewards, andshaping offers an opportunity to speed up the learning process.Reward shaping allows a reward function to be engineered in away to provide more frequent feedback signal on appropriatebehaviours [48], which is especially useful in domains withsparse rewards. Generally, the return from the reward functionis modified as follows: r′ = r+ f where r is the return from theoriginal reward function R, f is the additional reward from ashaping function F, and r′ is the signal given to the agentby the augmented reward function R′. Empirical evidencehas shown that reward shaping can be a powerful tool toimprove the learning speed of RL agents [49]. However, itcan have unintended consequences. The implication of addinga shaping reward is that a policy which is optimal for theaugmented reward function R′ may not in fact also be optimalfor the original reward function R. A classic example ofreward shaping gone wrong for this exact reason is reportedby [49] where the experimented bicycle agent would turn incircle to stay upright rather than reach its goal. Difference

8

rewards (D) [50] and potential-based reward shaping (PBRS)[51] are two commonly used shaping approaches. Both Dand PBRS have been successfully applied to a wide range ofapplication domains and have the added benefit of convenienttheoretical guarantees, meaning that they do not suffer fromthe same issues as the unprincipled reward shaping approachesdescribed above (see e.g. [51]–[55]).

B. Multi-agent reinforcement learning (MARL)

In multi-agent reinforcement learning, multiple RL agentsare deployed into a common environment. The single-agentMDP framework becomes inadequate when multiple au-tonomous agents act simultaneously in the same domain.Instead, the more general stochastic game (SG) may be used inthe case of a Multi-Agent System (MAS) [56]. A SG is definedas a tuple < S, A1...N ,T,R1...N >, where N is the number ofagents, S is the set of system states, A i is the set of actionsfor agent i (and A is the joint action set), T is the transitionfunction, and Ri is the reward function for agent i. The SGlooks very similar to the MDP framework, apart from theaddition of multiple agents. In fact, for the case of N = 1 a SGthen becomes a MDP. The next system state and the rewardsreceived by each agent depend on the joint action a of all of theagents in a SG, where a is derived from the combination of theindividual actions ai for each agent in the system. Each agentmay have its own local state perception si, which is differentto the system state s (i.e. individual agents are not assumed tohave full observability of the system). Note also that eachagent may receive a different reward for the same systemstate transition, as each agent has its own separate rewardfunction Ri. In a SG, the agents may all have the same goal(collaborative SG), totally opposing goals (competitive SG),or there may be elements of collaboration and competitionbetween agents (mixed SG). Whether RL agents in a MASwill learn to act together or at cross-purposes depends on thereward scheme used for a specific application.

C. Multi-objective reinforcement learning

In multi-objective reinforcement learning (MORL) the re-ward signal is a vector, where each component represents theperformance on a different objective. The MORL frameworkwas developed to handle sequential decision making problemswhere tradeoffs between conflicting objective functions mustbe considered. Examples of real-world problems with multipleobjectives include selecting energy sources (tradeoffs betweenfuel cost and emissions) [57] and watershed management(tradeoffs between generating electricity, preserving reservoirlevels and supplying drinking water) [58]. Solutions to MORLproblems are often evaluated using the concept of Paretodominance [59] and MORL algorithms typically seek to learnor approximate the set of non-dominated solutions. MORLproblems may be defined using the MDP or SG framework asappropriate, in a similar manner to single-objective problems.The main difference lies in the definition of the rewardfunction: instead of returning a single scalar value r, thereward function R in multi-objective domains returns a vectorr consisting of the rewards for each individual objective

c ∈ C. Therefore, a regular MDP or SG can be extended toa Multi-Objective MDP (MOMDP) or Multi-Objective SG(MOSG) by modifying the return of the reward function. For amore complete overview of MORL beyond the brief summarypresented in this section, the interested reader is referred torecent surveys [60], [61].

D. State Representation Learning (SRL)

State Representation Learning refers to feature extraction& dimensionality reduction to represent the state space withits history conditioned by the actions and environment of theagent. A complete review of SRL for control is discussedin [62]. In the simplest form SRL maps a high dimensionalvector ot into a small dimensional latent space st. The inverseoperation decodes the state back into an estimate of theoriginal observation ot. The agent then learns to map fromthe latent space to the action. Training the SRL chain isunsupervised in the sense that no labels are required. Reducingthe dimension of the input effectively simplifies the task asit removes noise and decreases the domain’s size as shownin [63]. SRL could be a simple auto-encoder (AE), thoughvarious methods exist for observation reconstruction such asVariational Auto-Encoders (VAE) or Generative AdversarialNetworks (GANs), as well as forward models for predictingthe next state or inverse models for predicting the action givena transition. A good learned state representation should beMarkovian; i.e. it should encode all necessary information tobe able to select an action based on the current state only, andnot any previous states or actions [62], [64].

E. Learning from Demonstrations

Learning from Demonstrations (LfD) is used by humansto acquire new skills in an expert to learner knowledgetransmission process. LfD is important for initial explorationwhere reward signals are too sparse or the input domain istoo large to cover. In LfD, an agent learns to perform atask from demonstrations, usually in the form of state-actionpairs, provided by an expert without any feedback rewards.However, high quality and diverse demonstrations are hard tocollect, leading to learning sub-optimal policies. Accordingly,learning merely from demonstrations can be used to initializethe learning agent with a good or safe policy, and thenreinforcement learning can be conducted to enable the dis-covery of a better policy by interacting with the environment.Combining demonstrations and reinforcement learning hasbeen conducted in recent research. AlphaGo [41], combinessearch tree with deep neural networks, initializes the policynetwork by supervised learning on state-action pairs providedby recorded games played by human experts. Additionally, avalue network is trained to tell how desirable a board state is.By conducting self-play and reinforcement learning, AlphaGois able to discover new stronger actions and learn from itsmistakes, achieving super human performance. More recently,AlphaZero [65], developed by the same team, proposed ageneral framework for self-play models. AlphaZero is trainedentirely using reinforcement learning and self play, startingfrom completely random play, and requires no prior knowledge

9

of human players. AlphaZero taught itself from scratch howto master the games of chess, shogi, and Go game, beating aworld-champion program in each case. In [66] it is shown thatgiven the initial demonstration, no explicit exploration is nec-essary, and we can attain near-optimal performance. Measuringthe divergence between the current policy and the expert policyfor optimization is proposed in [67]. DQfD [68] pre-trains theagent and uses expert demonstrations by adding them intothe replay buffer with additional priority. Moreover, a trainingframework that combines learning from both demonstrationsand reinforcement learning is proposed in [69] for fast learningagents. Two policies close to maximizing the reward functioncan still have large differences in behaviour. To avoid degener-ating a solution which would fit the reward but not the originalbehaviour, authors [70] proposed a method for enforcing thatthe optimal policy learnt over the rewards should still matchthe observed policy in behavior. Behavior Cloning (BC) isapplied as a supervised learning that maps states to actionsbased on demonstrations provided by an expert. On the otherhand, Inverse Reinforcement Learning (IRL) is about inferringthe reward function that justifies demonstrations of the expert.IRL is the problem of extracting a reward function givenobserved, optimal behavior [71]. A key motivation is that thereward function provides a succinct and robust definition ofa task. Generally, IRL algorithms can be expensive to run,requiring reinforcement learning in an inner loop betweencost estimation to policy training and evaluation. GenerativeAdversarial Imitation Learning (GAIL) [72] introduces a wayto avoid this expensive inner loop. In practice, GAIL trains apolicy close enough to the expert policy to fool a discriminator.This process is similar to GANs [73], [74]. The resultingpolicy must travel the same MDP states as the expert, or thediscriminator would pick up the differences. The theory behindGAIL is an equation simplification: qualitatively, if IRL isgoing from demonstrations to a cost function and RL from acost function to a policy, then we should altogether be ableto go from demonstration to policy in a single equation whileavoiding the cost function estimation.

V. REINFORCEMENT LEARNING FOR AUTONOMOUSDRIVING TASKS

Autonomous driving tasks where RL could be appliedinclude: controller optimization, path planning and trajectoryoptimization, motion planning and dynamic path planning,development of high-level driving policies for complex nav-igation tasks, scenario-based policy learning for highways,intersections, merges and splits, reward learning with inversereinforcement learning from expert data for intent predictionfor traffic actors such as pedestrian, vehicles and finallylearning of policies that ensures safety and perform riskestimation. Before discussing the applications of DRL to ADtasks we briefly review the state space, action space andrewards schemes in autonomous driving setting.

A. State Spaces, Action Spaces and Rewards

To successfully apply DRL to autonomous driving tasks,designing appropriate state spaces, action spaces, and reward

functions is important. Leurent et al. [75] provided a compre-hensive review of the different state and action representationswhich are used in autonomous driving research. Commonlyused state space features for an autonomous vehicle include:position, heading and velocity of ego-vehicle, as well as otherobstacles in the sensor view extent of the ego-vehicle. To avoidvariations in the dimension of the state space, a Cartesian orPolar occupancy grid around the ego vehicle is frequentlyemployed. This is further augmented with lane informationsuch as lane number (ego-lane or others), path curvature,past and future trajectory of the ego-vehicle, longitudinalinformation such as Time-to-collision (TTC), and finally sceneinformation such as traffic laws and signal locations.

Using raw sensor data such as camera images, LiDAR,radar, etc. provides the benefit of finer contextual information,while using condensed abstracted data reduces the complexityof the state space. In between, a mid-level representation suchas 2D bird eye view (BEV) is sensor agnostic but still close tothe spatial organization of the scene. Fig. 4 is an illustrationof a top down view showing an occupancy grid, past andprojected trajectories, and semantic information about thescene such as the position of traffic lights. This intermediaryformat retains the spatial layout of roads when graph-basedrepresentations would not. Some simulators offer this viewsuch as Carla or Flow (see Table V-C).

A vehicle policy must control a number of different actu-ators. Continuous-valued actuators for vehicle control includesteering angle, throttle and brake. Other actuators such asgear changes are discrete. To reduce complexity and allowthe application of DRL algorithms which work with discreteaction spaces only (e.g. DQN), an action space may bediscretised uniformly by dividing the range of continuousactuators such as steering angle, throttle and brake into equal-sized bins (see Section VI-C). Discretisation in log-spacehas also been suggested, as many steering angles which areselected in practice are close to the centre [76]. Discretisationdoes have disadvantages however; it can lead to jerky orunstable trajectories if the step values between actions are toolarge. Furthermore, when selecting the number of bins for anactuator there is a trade-off between having enough discretesteps to allow for smooth control, and not having so manysteps that action selections become prohibitively expensive toevaluate. As an alternative to discretisation, continuous valuesfor actuators may also be handled by DRL algorithms whichlearn a policy directly, (e.g. DDPG). Temporal abstractionsoptions framework [77]) may also be employed to simplifythe process of selecting actions, where agents select optionsinstead of low-level actions. These options represent a sub-policy that could extend a primitive action over multiple timesteps.

Designing reward functions for DRL agents for autonomousdriving is still very much an open question. Examples ofcriteria for AD tasks include: distance travelled towards adestination [78], speed of the ego vehicle [78]–[80], keepingthe ego vehicle at a standstill [81], collisions with other roadusers or scene objects [78], [79], infractions on sidewalks [78],keeping in lane, and maintaining comfort and stability whileavoiding extreme acceleration, braking or steering [80], [81],

10

and following traffic rules [79].

B. Motion Planning & Trajectory optimization

Motion planning is the task of ensuring the existence of apath between target and destination points. This is necessaryto plan trajectories for vehicles over prior maps usually aug-mented with semantic information. Path planning in dynamicenvironments and varying vehicle dynamics is a key prob-lem in autonomous driving, for example negotiating right topass through in an intersection [87], merging into highways.Recent work by authors [89] contains real world motions byvarious traffic actors, observed in diverse interactive drivingscenarios. Recently, authors demonstrated an application ofDRL (DDPG) for AD using a full-sized autonomous vehicle[90]. The system was first trained in simulation, before beingtrained in real time using on board computers, and was ableto learn to follow a lane, successfully completing a real-world trial on a 250 metre section of road. Model-based deepRL algorithms have been proposed for learning models andpolicies directly from raw pixel inputs [91], [92]. In [93],deep neural networks have been used to generate predictions insimulated environments over hundreds of time steps. RL is alsosuitable for Control. Classical optimal control methods likeLQR/iLQR are compared with RL methods in [94]. ClassicalRL methods are used to perform optimal control in stochasticsettings, for example the Linear Quadratic Regulator (LQR)in linear regimes and iterative LQR (iLQR) for non-linearregimes are utilized. A recent study in [95] demonstrates thatrandom search over the parameters for a policy network canperform as well as LQR.

C. Simulator & Scenario generation tools

Autonomous driving datasets address supervised learningsetup with training sets containing image, label pairs forvarious modalities. Reinforcement learning requires an en-vironment where state-action pairs can be recovered whilemodelling dynamics of the vehicle state, environment as wellas the stochasticity in the movement and actions of theenvironment and agent respectively. Various simulators areactively used for training and validating reinforcement learn-ing algorithms. Table V-C summarises various high fidelityperception simulators capable of simulating cameras, LiDARsand radar. Some simulators are also capable of providing thevehicle state and dynamics. A complete review of sensors andsimulators utilised within the autonomous driving communityis available in [105] for readers. Learned driving policies arestress tested in simulated environments before moving on tocostly evaluations in the real world. Multi-fidelity reinforce-ment learning (MFRL) framework is proposed in [106] wheremultiple simulators are available. In MFRL, a cascade ofsimulators with increasing fidelity are used in representingstate dynamics (and thus computational cost) that enablesthe training and validation of RL algorithms, while findingnear optimal policies for the real world with fewer expensivereal world samples using a remote controlled car. CARLAChallenge [107] is a Carla simulator based AD competitionwith pre-crash scenarios characterized in a National Highway

Traffic Safety Administration report [108]. The systems areevaluated in critical scenarios such as: Ego-vehicle losescontrol, ego-vehicle reacts to unseen obstacle, lane changeto evade slow leading vehicle among others. The scores ofagents are evaluated as a function of the aggregated distancetravelled in different circuits, and total points discounted dueto infractions.

D. LfD and IRL for AD applications

Early work on Behavior Cloning (BC) for driving cars in[109], [110] presented agents that learn form demonstrations(LfD) that tries to mimic the behavior of an expert. BC istypically implemented as a supervised learning, and accord-ingly, it is hard for BC to adapt to new, unseen situations.An architecture for learning a convolutional neural network,end to end, in self-driving cars domain was proposed in[111], [112]. The CNN is trained to map raw pixels froma single front facing camera directly to steering commands.Using a relatively small training dataset from humans/experts,the system learns to drive in traffic on local roads with orwithout lane markings and on highways. The network learnsimage representations that detect the road successfully, withoutbeing explicitly trained to do so. Authors of [113] proposedto learn comfortable driving trajectories optimization usingexpert demonstration from human drivers using MaximumEntropy Inverse RL. Authors of [114] used DQN as therefinement step in IRL to extract the rewards, in an effortlearn human-like lane change behavior.

VI. REAL WORLD CHALLENGES AND FUTUREPERSPECTIVES

In this section, challenges for conducting reinforcementlearning for real-world autonomous driving are presentedand discussed along with the related research approaches forsolving them.

A. Validating RL systems

Henderson et al. [115] described challenges in validatingreinforcement learning methods focusing on policy gradientmethods for continuous control algorithms such as PPO,DDPG and TRPO as well as in reproducing benchmarks. Theydemonstrate with real examples that implementations oftenhave varying code-bases and different hyper-parameter values,and that unprincipled ways to estimate the top-k rolloutscould lead to incoherent interpretations on the performanceof the reinforcement learning algorithms, and further more onhow well they generalize. Authors concluded that evaluationcould be performed either on a well defined common setupor on real-world tasks. Authors in [116] proposed automatedgeneration of challenging and rare driving scenarios in high-fidelity photo-realistic simulators. These adversarial scenariosare automatically discovered by parameterising the behaviorof pedestrians and other vehicles on the road. Moreover, it isshown that by adding these scenarios to the training data ofimitation learning, the safety is increased.

11

Fig. 4. Bird Eye View (BEV) 2D representation of a driving scene. Left demonstrates an occupancy grid. Right shows the combination of semantic information(traffic lights) with past (red) and projected (green) trajectories. The ego car is represented by a green rectangle in both images.

AD Task (D)RL method & description Improvements & Tradeoffs

Lane Keep 1. Authors [82] propose a DRL system for discrete actions(DQN) and continuous actions (DDAC) using the TORCSsimulator (see Table V-C) 2. Authors [83] learn discretisedand continuous policies using DQNs and Deep DeterministicActor Critic (DDAC) to follow the lane and maximize averagevelocity.

1. This study concludes that using continuous actions providesmoother trajectories, though on the negative side lead tomore restricted termination conditions & slower convergencetime to learn. 2. Removing memory replay in DQNs helpfor faster convergence & better performance. The one hotencoding of action space resulted in abrupt steering control.While DDAC’s continuous policy helps smooth the actionsand provides better performance.

Lane Change Authors [84] use Q-learning to learn a policy for ego-vehicle to perform no operation, lane change to left/right,accelerate/decelerate.

This approach is more robust compared to traditional ap-proaches which consist in defining fixed way points, velocityprofiles and curvature of path to be followed by the egovehicle.

Ramp Merging Authors [85] propose recurrent architectures namely LSTMsto model longterm dependencies for ego vehicles ramp merg-ing into a highway.

Past history of the state information is used to perform themerge more robustly.

Overtaking Authors [86] propose Multi-goal RL policy that is learnt byQ-Learning or Double action Q-Learning(DAQL) is employedto determine individual action decisions based on whether theother vehicle interacts with the agent for that particular goal.

Improved speed for lane keeping and overtaking with collisionavoidance.

Intersections Authors use DQN to evalute the Q-value for state-action pairsto negotiate intersection [87],

Creep-Go actions defined by authors enables the vehicle tomaneuver intersections with restricted spaces and visibilitymore safely

Motion Planning Authors [88] propose an improved A∗ algorithm to learn aheuristic function using deep neural netowrks over image-based input obstacle map

Smooth control behavior of vehicle and better peformancecompared to multi-step DQN

TABLE ILIST OF AD TASKS THAT REQUIRE D(RL) TO LEARN A POLICY OR BEHAVIOR.

B. Bridging the simulation-reality gap

Simulation-to-real-world transfer learning is an active do-main, since simulations are a source large & cheap data withperfect annotations. Authors [117] train a robot arm to graspobjects in the real world by performing domain adaptionfrom simulation-to-reality, at both feature-level and pixel-level. The vision-based grasping system achieved comparableperformance with 50 times fewer real-world samples. Authorsin [118], randomized the dynamics of the simulator duringtraining. The resulting policies were capable of generalising todifferent dynamics without requiring retraining on real system.In the domain of autonomous driving, authors [119] trainan A3C agent using simulation-real translated images of thedriving environment. Following which, the trained policy wasevaluated on a real world driving dataset.

Authors in [120] addressed the issue of performing imitationlearning in simulation that transfers well to images from realworld. They achieved this by unsupervised domain translationbetween simulated and real world images, that enables learning

the prediction of steering in the real world domain with onlyground truth from the simulated domain. Authors remark thatthere were no pairwise correspondences between images in thesimulated training set and the unlabelled real-world image set.Similarly, [121] performs domain adaptation to map real worldimages to simulated images. In contrast to sim-to-real methodsthey handle the reality gap during deployment of agents in realscenarios, by adapting the real camera streams to the syntheticmodality, so as to map the unfamiliar or unseen features ofreal images back into the simulated environment and states.The agents have already learnt a policy in simulation.

C. Sample efficiency

Animals are usually able to learn new tasks in just a fewtrials, benefiting from their prior knowledge about the environ-ment. However, one of the key challenges for reinforcementlearning is sample efficiency. The learning process requires toomany samples to learn a reasonable policy. This issue becomesmore noticeable when collection of valuable experience is

12

Simulator Description

CARLA [78] Urban simulator, Camera & LIDAR streams, with depth & semantic segmentation, Location informationTORCS [96] Racing Simulator, Camera stream, agent positions, testing control policies for vehiclesAIRSIM [97] Camera stream with depth and semantic segmentation, support for dronesGAZEBO (ROS) [98] Multi-robot physics simulator employed for path planning & vehicle control in complex 2D & 3D mapsSUMO [99] Macro-scale modelling of traffic in cities motion planning simulators are usedDeepDrive [100] Driving simulator based on unreal, providing multi-camera (eight) stream with depthConstellation [101] NVIDIA DRIVE ConstellationTM simulates camera, LIDAR & radar for AD (Proprietary)MADRaS [102] Multi-Agent Autonomous Driving Simulator built on top of TORCSFlow [103] Multi-Agent Traffic Control Simulator built on top of SUMOHighway-env [104] A gym-based environment that provides a simulator for highway based road topologiesCarcraft Waymo’s simulation environment (Proprietary)

TABLE IISIMULATORS FOR RL APPLICATIONS IN ADVANCED DRIVING ASSISTANCE SYSTEMS (ADAS) AND AUTONOMOUS DRIVING.

expensive or even risky to acquire. In the case of robot controland autonomous driving, sample efficiency is a difficult issuedue to the delayed and sparse rewards found in typical settings,along with an unbalanced distribution of observations in alarge state space.

Reward shaping enables the agent to learn intermediategoals by designing a more frequent reward function to en-courage the agent to learn faster from fewer samples. Authorsin [122] design a second "trauma" replay memory that containsonly collision situations in order to pool positive and negativeexperiences at each training step.

IL boostrapped RL: Further efficiency can be achievedwhere the agent first learns an initial policy offline perform-ing imitation learning from roll-outs provided by an expert.Following which, the agent can self-improve by applying RLwhile interacting with the environment.

Actor Critic with Experience Replay (ACER) [123], is asample-efficient policy gradient algorithm that makes use of areplay buffer, enabling it to perform more than one gradientupdate using each piece of sampled experience, as well as atrust region policy optimization method.

Transfer learning is another approach for sample effi-ciency, which enables the reuse of previously trained policy fora source task to initialize the learning of a target task. Policycomposition presented in [124] propose composing previouslylearned basis policies to be able to reuse them for a novel task,which leads to faster learning of new policies. A survey ontransfer learning in RL is presented in [125]. Multi-fidelityreinforcement learning (MFRL) framework [106] showed totransfer heuristics to guide exploration in high fidelity simu-lators and find near optimal policies for the real world withfewer real world samples. Authors in [126] transferred policieslearnt to handle simulated intersections to real world examplesbetween DQN agents.

Meta-learning algorithms enable agents adapt to new tasksand learn new skills rapidly from small amounts of ex-periences, benefiting from their prior knowledge about theworld. Authors of [127] addressed this issue through traininga recurrent neural network on a training set of interrelatedtasks, where the network input includes the action selectedin addition to the reward received in the previous time step.Accordingly, the agent is trained to learn to exploit thestructure of the problem dynamically and solve new problemsby adjusting its hidden state. A similar approach for designing

RL algorithms is presented in [128]. Rather than designing a“fast” reinforcement learning algorithm, it is represented as arecurrent neural network, and learned from data. In Model-Agnostic Meta-Learning (MAML) proposed in [129], themeta-learner seeks to find an initialisation for the parametersof a neural network, that can be adapted quickly for a newtask using only few examples. Reptile [130] includes a similarmodel. Authors [131] present simple gradient-based meta-learning algorithm.

Efficient state representations : World models proposed in[132] learn a compressed spatial and temporal representationof the environment using VAEs. Further on a compact andsimple policy directly from the compressed state representa-tion.

D. Exploration issues with Imitation

In imitation learning, the agent makes use of trajectoriesprovided by an expert. However, the distribution of statesthe expert encounters usually does not cover all the statesthe trained agent may encounter during testing. Furthermoreimitation assumes that the actions are independent and iden-tically distributed (i.i.d.). One solution consists in using theData Aggregation (DAgger) methods [133] where the end-to-end learned policy is executed, and extracted observation-action pairs are again labelled by the expert, and aggregatedto the original expert observation-action dataset. Thus, iter-atively collecting training examples from both reference andtrained policies explores more valuable states and solves thislack of exploration. Following work on Search-based Struc-tured Prediction (SEARN) [133], Stochastic Mixing IterativeLearning (SMILE) trains a stochastic stationary policy overseveral iterations and then makes use of a geometric stochasticmixing of the policies trained. In a standard imitation learningscenario, the demonstrator is required to cover sufficient statesso as to avoid unseen states during test. This constraint iscostly and requires frequent human intervention. More re-cently, Chauffeurnet [134] demonstrated the limits of imitationlearning where even 30 million state-action samples wereinsufficient to learn an optimal policy that mapped bird-eyeview images (states) to control (action). The authors proposethe use of simulated examples which introduced perturbations,higher diversity of scenarios such as collisions and/or goingoff the road. The featurenet includes an agent RNN that

13

outputs the way point, agent box position and heading at eachiteration. Authors [135] identify limits of imitation learning,and train a DNN end-to-end using the ego vehicles on inputraw image, and 2d and 3d locations of neighboring vehiclesto simultaneously predict the ego-vehicle action as well asneighbouring vehicle trajectories.

E. Intrinsic Reward functions

In controlled simulated environments such as games, anexplicit reward signal is given to the agent along with its sensorstream. However, in real-world robotics and autonomous driv-ing deriving, designing a good reward functions is essentialso that the desired behaviour may be learned. The mostcommon solution has been reward shaping [136] and consistsin supplying additional well designed rewards to the agent toencourage the optimization into the direction of the optimalpolicy. Rewards as already pointed earlier in the paper, couldbe estimated by inverse RL (IRL) [137], which depends onexpert demonstrations. In the absence of an explicit rewardshaping and expert demonstrations, agents can use intrinsicrewards or intrinsic motivation [138] to evaluate if their actionswere good or not. Authors of [139] define curiosity as the errorin an agent’s ability to predict the consequence of its ownactions in a visual feature space learned by a self-supervisedinverse dynamics model. In [140] the agent learns a next statepredictor model from its experience, and uses the error ofthe prediction as an intrinsic reward. This enables that agentto determine what could be a useful behavior even withoutextrinsic rewards.

F. Incorporating safety in DRL

Deploying an autonomous vehicle in real environments aftertraining directly could be dangerous. Different approaches toincorporate safety into DRL algorithms are presented here.For imitation learning based systems, Safe DAgger [141]introduces a safety policy that learns to predict the errormade by a primary policy trained initially with the supervisedlearning approach, without querying a reference policy. Anadditional safe policy takes both the partial observation ofa state and a primary policy as inputs, and returns a binarylabel indicating whether the primary policy is likely to deviatefrom a reference policy without querying it. Authors of [142]addressed safety in multi-agent Reinforcement Learning forAutonomous Driving, where a balance is maintained betweenunexpected behavior of other drivers or pedestrians and notto be too defensive, so that normal traffic flow is achieved.While hard constraints are maintained to guarantee the safetyof driving, the problem is decomposed into a composition ofa policy for desires to enable comfort driving and trajectoryplanning. The deep reinforcement learning algorithms for con-trol such as DDPG and safety based control are combined in[143], including artificial potential field method that is widelyused for robot path planning. Using TORCS environment,the DDPG is applied first for learning a driving policy ina stable and familiar environment, then policy network andsafety-based control are combined to avoid collisions. It wasfound that combination of DRL and safety-based control

performs well in most scenarios. In order to enable DRL toescape local optima, speed up the training process and avoiddanger conditions or accidents, Survival-Oriented Reinforce-ment Learning (SORL) model is proposed in [144], wheresurvival is favored over maximizing total reward throughmodeling the autonomous driving problem as a constrainedMDP and introducing Negative-Avoidance Function to learnfrom previous failure. The SORL model was found to benot sensitive to reward function and can use different DRLalgorithms like DDPG. Furthermore, a comprehensive surveyon safe reinforcement learning can be found in [145] forinterested readers.

G. Multi-agent reinforcement learning

Autonomous driving is a fundamentally multi-agent task;as well as the ego vehicle being controlled by an agent,there will also be many other actors present in simulatedand real world autonomous driving settings, such as pedes-trians, cyclists and other vehicles. Therefore, the continueddevelopment of explicitly multi-agent approaches to learningto drive autonomous vehicles is an important future researchdirection. Several prior methods have already approached theautonomous driving problem using a MARL perspective, e.g.[142], [146]–[149].

One important area where MARL techniques could be verybeneficial is in high-level decision making and coordinationbetween groups of autonomous vehicles, in scenarios suchas overtaking in highway scenarios [149], or negotiatingintersections without signalised control. Another area whereMARL approaches could be of benefit is in the developmentof adversarial agents for testing autonomous driving policiesbefore deployment [148], i.e. agents controlling other vehiclesin a simulation that learn to expose weaknesses in the be-haviour of autonomous driving policies by acting erratically oragainst the rules of the road. Finally, MARL approaches couldpotentially have an important role to play in developing safepolicies for autonomous driving [142], as discussed earlier.

VII. CONCLUSION

Reinforcement learning is still an active and emerging areain real-world autonomous driving applications. Although thereare a few successful commercial applications, there is verylittle literature or large-scale public datasets available. Thus wewere motivated to formalize and organize RL applications forautonomous driving. Autonomous driving scenarios involve in-teracting agents and require negotiation and dynamic decisionmaking which suits RL. However, there are many challenges tobe resolved in order to have mature solutions which we discussin detail. In this work, a detailed theoretical reinforcementlearning is presented, along with a comprehensive literaturesurvey about applying RL for autonomous driving tasks.

Challenges, future research directions and opportunitiesare discussed in section VI. This includes : validating theperformance of RL based systems, the simulation-reality gap,sample efficiency, designing good reward functions, incorpo-rating safety into decision making RL systems for autonomousagents.

14

Framework Description

OpenAI Baselines [150] Set of high-quality implementations of different RL and DRL algorithms. The main goal for theseBaselines is to make it easier for the research community to replicate, refine and create reproducibleresearch.

Unity ML Agents Toolkit [151] Implements core RL algorithms, games, simulations environments for training RL or IL based agents .RL Coach [152] Intel AI Lab’s implementation of modular RL algorithms implementation with simple integration of new

environments by extending and reusing existing components.Tensorflow Agents [153] RL algorithms package with Bandits from TF.rlpyt [154] implements deep Q-learning, policy gradients algorithm families in a single python packagebsuite [155] DeepMind Behaviour Suite for Reinforcement Learning aims at defining metrics for RL agents.

Automating evaluation and analysis.

TABLE IIIOPEN-SOURCE FRAMEWORKS AND PACKAGES FOR STATE OF THE ART RL/DRL ALGORITHMS AND EVALUATION.

Reinforcement learning results are usually difficult to re-produce and are highly sensitive to hyper-parameter choices,which are often not reported in detail. Both researchers andpractitioners need to have a reliable starting point wherethe well known reinforcement learning algorithms are imple-mented, documented and well tested. These frameworks havebeen covered in table VI-G.

The development of explicitly multi-agent reinforcementlearning approaches to the autonomous driving problem isalso an important future challenge that has not received a lotof attention to date. MARL techniques have the potential tomake coordination and high-level decision making betweengroups of autonomous vehicles easier, as well as providingnew opportunities for testing and validating the safety ofautonomous driving policies.

Furthermore, implementation of RL algorithms is a chal-lenging task for researchers and practitioners. This workpresents examples of well known and active open-source RLframeworks that provide well documented implementationsthat enables the opportunity of using, evaluating and extendingdifferent RL algorithms. Finally, We hope that this overviewpaper encourages further research and applications.

A2C, A3C Advantage Actor Critic, Asynchronous A2CBC Behavior CloningDDPG Deep DPGDP Dynamic ProgrammingDPG Deterministic PGDQN Deep Q-NetworkDRL Deep RLIL Imitation LearningIRL Inverse RLLfD Learning from DemonstrationMAML Model-Agnostic Meta-LearningMARL Multi-Agent RLMDP Markov Decision ProcessMOMDP Multi-Objective MDPMOSG Multi-Objective SGPG Policy GradientPOMDP Partially Observed MDPPPO Proximal Policy OptimizationQL Q-LearningRRT Rapidly-exploring Random TreesSG Stochastic GameSMDP Semi-Markov Decision ProcessTDL Time Difference LearningTRPO Trust Region Policy Optimization

TABLE IVACRONYMS RELATED TO REINFORCEMENT LEARNING (RL).

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction(Second Edition). MIT Press, 2018. 1, 4, 5, 6

[2] V. Talpaert., I. Sobh., B. R. Kiran., P. Mannion., S. Yogamani., A. El-Sallab., and P. Perez., “Exploring applications of deep reinforcementlearning for real-world autonomous driving systems,” in Proceedings ofthe 14th International Joint Conference on Computer Vision, Imagingand Computer Graphics Theory and Applications - Volume 5 VISAPP:VISAPP,, INSTICC. SciTePress, 2019, pp. 564–572. 1

[3] M. Siam, S. Elkerdawy, M. Jagersand, and S. Yogamani, “Deepsemantic segmentation for automated driving: Taxonomy, roadmap andchallenges,” in 2017 IEEE 20th international conference on intelligenttransportation systems (ITSC). IEEE, 2017, pp. 1–8. 2

[4] K. El Madawi, H. Rashed, A. El Sallab, O. Nasr, H. Kamel, andS. Yogamani, “Rgb and lidar fusion based 3d semantic segmentation forautonomous driving,” in 2019 IEEE Intelligent Transportation SystemsConference (ITSC). IEEE, 2019, pp. 7–12. 2

[5] M. Siam, H. Mahgoub, M. Zahran, S. Yogamani, M. Jagersand, andA. El-Sallab, “Modnet: Motion and appearance based moving objectdetection network for autonomous driving,” in 2018 21st InternationalConference on Intelligent Transportation Systems (ITSC). IEEE, 2018,pp. 2859–2864. 2

[6] V. R. Kumar, S. Milz, C. Witt, M. Simon, K. Amende, J. Petzold,S. Yogamani, and T. Pech, “Monocular fisheye camera depth estimationusing sparse lidar supervision,” in 2018 21st International Conferenceon Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2853–2858. 2

[7] M. Uricár, P. Krížek, G. Sistu, and S. Yogamani, “Soilingnet: Soilingdetection on automotive surround-view cameras,” in 2019 IEEE Intel-ligent Transportation Systems Conference (ITSC). IEEE, 2019, pp.67–72. 2

[8] G. Sistu, I. Leang, S. Chennupati, S. Yogamani, C. Hughes, S. Milz, andS. Rawashdeh, “Neurall: Towards a unified visual perception model forautomated driving,” in 2019 IEEE Intelligent Transportation SystemsConference (ITSC). IEEE, 2019, pp. 796–803. 2

[9] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea,M. Uricár, S. Milz, M. Simon, K. Amende et al., “Woodscape: Amulti-task, multi-camera fisheye dataset for autonomous driving,” inProceedings of the IEEE International Conference on Computer Vision,2019, pp. 9308–9318. 2

[10] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani, “Visualslam for automated driving: Exploring the applications of deep learn-ing,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops, 2018, pp. 247–257. 2

[11] S. M. LaValle, Planning Algorithms. New York, NY, USA: CambridgeUniversity Press, 2006. 2

[12] S. M. LaValle and J. James J. Kuffner, “Randomized kinodynamicplanning,” The International Journal of Robotics Research, vol. 20,no. 5, pp. 378–400, 2001. 2

[13] T. D. Team. Dimensions publication trends. [Online]. Available:https://app.dimensions.ai/discover/publication 3

[14] Y. Kuwata, J. Teo, G. Fiore, S. Karaman, E. Frazzoli, and J. P. How,“Real-time motion planning with applications to autonomous urbandriving,” IEEE Transactions on Control Systems Technology, vol. 17,no. 5, pp. 1105–1118, 2009. 3

[15] B. Paden, M. Cáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey ofmotion planning and control techniques for self-driving urban vehicles,”

15

IEEE Transactions on intelligent vehicles, vol. 1, no. 1, pp. 33–55,2016. 3

[16] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics,and Autonomous Systems, no. 0, 2018. 3

[17] S. Kuutti, R. Bowden, Y. Jin, P. Barber, and S. Fallah, “A surveyof deep learning applications to autonomous vehicle control,” IEEETransactions on Intelligent Transportation Systems, 2020. 3

[18] T. M. Mitchell, Machine learning, ser. McGraw-Hill series in computerscience. Boston (Mass.), Burr Ridge (Ill.), Dubuque (Iowa): McGraw-Hill, 1997. 3

[19] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach(3rd edition). Prentice Hall, 2009. 3

[20] Z.-W. Hong, T.-Y. Shann, S.-Y. Su, Y.-H. Chang, T.-J. Fu, and C.-Y. Lee, “Diversity-driven exploration strategy for deep reinforcementlearning,” in Advances in Neural Information Processing Systems 31,S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, Eds., 2018, pp. 10 489–10 500. 3

[21] M. Wiering and M. van Otterlo, Eds., Reinforcement Learning: State-of-the-Art. Springer, 2012. 4

[22] M. L. Puterman, Markov Decision Processes: Discrete StochasticDynamic Programming, 1st ed. New York, NY, USA: John Wiley& Sons, Inc., 1994. 4

[23] C. J. Watkins and P. Dayan, “Technical note: Q-learning,” MachineLearning, vol. 8, no. 3-4, 1992. 4

[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, 2015. 4, 6

[25] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. disserta-tion, King’s College, Cambridge, 1989. 4, 6

[26] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic policy gradient algorithms,” in ICML, 2014. 5

[27] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,” Machine Learning, vol. 8, pp.229–256, 1992. 5

[28] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in International Conference on MachineLearning, 2015, pp. 1889–1897. 5

[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,2017. 5

[30] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning.” in 4th International Conference on Learning Represen-tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, ConferenceTrack Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. 5

[31] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in International Conference on Machine Learning,2016. 5

[32] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcementlearning with deep energy-based policies,” in Proceedings of the 34thInternational Conference on Machine Learning - Volume 70, ser.ICML’17. JMLR.org, 2017, pp. 1352–1361. 6

[33] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-criticalgorithms & applications,” arXiv:1812.05905, 2018. 6

[34] R. S. Sutton, “Integrated architectures for learning, planning, andreacting based on approximating dynamic programming,” in MachineLearning Proceedings 1990. Elsevier, 1990. 6

[35] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial timealgorithm for near-optimal reinforcement learning,” Journal of MachineLearning Research, vol. 3, no. Oct, 2002. 6

[36] D. Silver, R. S. Sutton, and M. Müller, “Sample-based learning andsearch with permanent and transient memories,” in Proceedings of the25th international conference on Machine learning. ACM, 2008, pp.968–975. 6

[37] G. A. Rummery and M. Niranjan, “On-line Q-learning using con-nectionist systems,” Cambridge University Engineering Department,Cambridge, England, Tech. Rep. TR 166, 1994. 6

[38] R. S. Sutton and A. G. Barto, “Reinforcement learning an introduction–second edition, in progress (draft),” 2015. 6

[39] R. Bellman, Dynamic Programming. Princeton, NJ, USA: PrincetonUniversity Press, 1957. 6

[40] G. Tesauro, “Td-gammon, a self-teaching backgammon program,achieves master-level play,” Neural Computing, vol. 6, no. 2, Mar. 1994.6

[41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neural networksand tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016. 6, 8

[42] K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understandingfor text-based games using deep reinforcement learning,” in Pro-ceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing. Association for Computational Linguistics,2015, pp. 1–11. 7

[43] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experiencereplay,” arXiv preprint arXiv:1511.05952, 2015. 7

[44] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning.” in AAAI, vol. 16, 2016, pp. 2094–2100. 7

[45] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, andN. De Freitas, “Dueling network architectures for deep reinforcementlearning,” arXiv preprint arXiv:1511.06581, 2015. 7

[46] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partiallyobservable mdps,” CoRR, abs/1507.06527, 2015. 7

[47] B. F. Skinner, The behavior of organisms: An experimental analysis.Appleton-Century, 1938. 7

[48] E. Wiewiora, “Reward shaping,” in Encyclopedia of Machine Learningand Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA:Springer US, 2017, pp. 1104–1106. 7

[49] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using re-inforcement learning and shaping,” in Proceedings of the FifteenthInternational Conference on Machine Learning, ser. ICML ’98. SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp.463–471. 7

[50] D. H. Wolpert, K. R. Wheeler, and K. Tumer, “Collective intelligencefor control of distributed dynamical systems,” EPL (Europhysics Let-ters), vol. 49, no. 6, p. 708, 2000. 8

[51] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance underreward transformations: Theory and application to reward shaping,”in Proceedings of the Sixteenth International Conference on MachineLearning, ser. ICML ’99, 1999, pp. 278–287. 8

[52] S. Devlin and D. Kudenko, “Theoretical considerations of potential-based reward shaping for multi-agent systems,” in Proceedings of the10th International Conference on Autonomous Agents and MultiagentSystems (AAMAS), 2011. 8

[53] P. Mannion, S. Devlin, K. Mason, J. Duggan, and E. Howley, “Policyinvariance under reward transformations for multi-objective reinforce-ment learning,” Neurocomputing, vol. 263, 2017. 8

[54] M. Colby and K. Tumer, “An evolutionary game theoretic analysis ofdifference evaluation functions,” in Proceedings of the 2015 AnnualConference on Genetic and Evolutionary Computation. ACM, 2015,pp. 1391–1398. 8

[55] P. Mannion, J. Duggan, and E. Howley, “A theoretical and empiricalanalysis of reward transformations in multi-objective stochastic games,”in Proceedings of the 16th International Conference on AutonomousAgents and Multiagent Systems (AAMAS), 2017. 8

[56] L. Busoniu, R. Babuška, and B. Schutter, “Multi-agent reinforcementlearning: An overview,” in Innovations in Multi-Agent Systems and Ap-plications - 1, ser. Studies in Computational Intelligence, D. Srinivasanand L. Jain, Eds. Springer Berlin Heidelberg, 2010, vol. 310. 8

[57] P. Mannion, K. Mason, S. Devlin, J. Duggan, and E. Howley, “Multi-objective dynamic dispatch optimisation using multi-agent reinforce-ment learning,” in Proceedings of the 15th International Conferenceon Autonomous Agents and Multiagent Systems (AAMAS), 2016. 8

[58] K. Mason, P. Mannion, J. Duggan, and E. Howley, “Applying multi-agent reinforcement learning to watershed management,” in Proceed-ings of the Adaptive and Learning Agents workshop (at AAMAS 2016),2016. 8

[59] V. Pareto, Manual of political economy. OUP Oxford, 1906. 8[60] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey

of multi-objective sequential decision-making,” Journal of ArtificialIntelligence Research, vol. 48, pp. 67–113, 2013. 8

[61] R. Radulescu, P. Mannion, D. M. Roijers, and A. Nowé, “Multi-objective multi-agent decision making: a utility-based analysis andsurvey,” Autonomous Agents and Multi-Agent Systems, vol. 34, no. 1,p. 10, 2020. 8

[62] T. Lesort, N. Diaz-Rodriguez, J.-F. Goudou, and D. Filliat, “Staterepresentation learning for control: An overview,” Neural Networks,vol. 108, pp. 379 – 392, 2018. 8

16

[63] A. Raffin, A. Hill, K. R. Traoré, T. Lesort, N. D. Rodríguez, andD. Filliat, “Decoupling feature extraction from policy learning: assess-ing benefits of state representation learning in goal based robotics,”CoRR, vol. abs/1901.08651, 2019. 8

[64] W. Böhmer, J. T. Springenberg, J. Boedecker, M. Riedmiller, andK. Obermayer, “Autonomous learning of state representations forcontrol: An emerging field aims to autonomously learn state represen-tations for reinforcement learning agents from their real-world sensorobservations,” KI-Künstliche Intelligenz, vol. 29, no. 4, pp. 353–362,2015. 8

[65] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering thegame of go without human knowledge,” Nature, vol. 550, no. 7676, p.354, 2017. 8

[66] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learningin reinforcement learning,” in Proceedings of the 22nd internationalconference on Machine learning. ACM, 2005, pp. 1–8. 9

[67] B. Kang, Z. Jie, and J. Feng, “Policy optimization with demonstra-tions,” in International Conference on Machine Learning, 2018, pp.2474–2483. 9

[68] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep q-learningfrom demonstrations,” in Thirty-Second AAAI Conference on ArtificialIntelligence, 2018. 9

[69] S. Ibrahim and D. Nevin, “End-to-end framework for fast learningasynchronous agents,” in the 32nd Conference on Neural InformationProcessing Systems, Imitation Learning and its Challenges in Roboticsworkshop, 2018. 9

[70] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-forcement learning,” in Proceedings of the twenty-first internationalconference on Machine learning. ACM, 2004, p. 1. 9

[71] A. Y. Ng, S. J. Russell et al., “Algorithms for inverse reinforcementlearning.” in ICML, 2000. 9

[72] J. Ho and S. Ermon, “Generative adversarial imitation learning,” inAdvances in Neural Information Processing Systems, 2016, pp. 4565–4573. 9

[73] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in Advances in Neural Information Processing Systems 27, 2014, pp.2672–2680. 9

[74] M. Uricár, P. Krížek, D. Hurych, I. Sobh, S. Yogamani, and P. Denny,“Yes, we gan: Applying adversarial techniques for autonomous driv-ing,” Electronic Imaging, vol. 2019, no. 15, pp. 48–1, 2019. 9

[75] E. Leurent, Y. Blanco, D. Efimov, and O.-A. Maillard, “A survey ofstate-action representations for autonomous driving,” HAL archives,2018. 9

[76] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of drivingmodels from large-scale video datasets,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2017, pp.2174–2182. 9

[77] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning,”Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999. 9

[78] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in Proceedings of the1st Annual Conference on Robot Learning, 2017, pp. 1–16. 9, 12

[79] C. Li and K. Czarnecki, “Urban driving with multi-objective deepreinforcement learning,” in Proceedings of the 18th International Con-ference on Autonomous Agents and MultiAgent Systems. InternationalFoundation for Autonomous Agents and Multiagent Systems, 2019, pp.359–367. 9, 10

[80] S. Kardell and M. Kuosku, “Autonomous vehicle control via deepreinforcement learning,” Master’s thesis, Chalmers University of Tech-nology, 2017. 9

[81] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcementlearning for urban autonomous driving,” in 2019 IEEE IntelligentTransportation Systems Conference (ITSC). IEEE, 2019, pp. 2765–2771. 9

[82] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-enddeep reinforcement learning for lane keeping assist,” in MLITS, NIPSWorkshop, vol. 2, 2016. 11

[83] A.-E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforce-ment learning framework for autonomous driving,” Electronic Imaging,vol. 2017, no. 19, pp. 70–76, 2017. 11

[84] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learningbased approach for automated lane change maneuvers,” in 2018 IEEEIntelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1379–1384. 11

[85] P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learningarchitecture toward autonomous driving for on-ramp merge,” in Intel-ligent Transportation Systems (ITSC), 2017 IEEE 20th InternationalConference on. IEEE, 2017, pp. 1–6. 11

[86] D. C. K. Ngai and N. H. C. Yung, “A multiple-goal reinforcementlearning method for complex vehicle overtaking maneuvers,” IEEETransactions on Intelligent Transportation Systems, vol. 12, no. 2, pp.509–522, 2011. 11

[87] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,“Navigating occluded intersections with autonomous vehicles usingdeep reinforcement learning,” in 2018 IEEE International Conferenceon Robotics and Automation (ICRA). IEEE, 2018, pp. 2034–2039.10, 11

[88] A. Keselman, S. Ten, A. Ghazali, and M. Jubeh, “Reinforcement learn-ing with a* and a deep heuristic,” arXiv preprint arXiv:1811.07745,2018. 11

[89] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Küm-merle, H. Königshof, C. Stiller, A. de La Fortelle, and M. Tomizuka,“INTERACTION Dataset: An INTERnational, Adversarial and Coop-erative moTION Dataset in Interactive Driving Scenarios with SemanticMaps,” arXiv:1910.03088 [cs, eess], 2019. 10

[90] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D.Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in 2019International Conference on Robotics and Automation (ICRA). IEEE,2019, pp. 8248–8254. 10

[91] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embedto control: A locally linear latent dynamics model for control from rawimages,” in Advances in neural information processing systems, 2015.10

[92] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “Learning deepdynamical models from image pixels,” IFAC-PapersOnLine, vol. 48,no. 28, pp. 1059–1064, 2015. 10

[93] S. Chiappa, S. Racanière, D. Wierstra, and S. Mohamed, “Recurrentenvironment simulators,” in 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France, April 24-26, 2017,Conference Track Proceedings. OpenReview.net, 2017. 10

[94] B. Recht, “A tour of reinforcement learning: The view from contin-uous control,” Annual Review of Control, Robotics, and AutonomousSystems, 2008. 10

[95] H. Mania, A. Guy, and B. Recht, “Simple random search of staticlinear policies is competitive for reinforcement learning,” in Advancesin Neural Information Processing Systems 31, S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.,2018, pp. 1800–1809. 10

[96] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, andA. Sumner, “Torcs, the open racing car simulator,” Software availableat http://torcs. sourceforge. net, vol. 4, 2000. 12

[97] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelityvisual and physical simulation for autonomous vehicles,” in Field andService Robotics. Springer, 2018, pp. 621–635. 12

[98] N. Koenig and A. Howard, “Design and use paradigms for gazebo, anopen-source multi-robot simulator,” in 2004 International Conferenceon Intelligent Robots and Systems (IROS), vol. 3. IEEE, 2004, pp.2149–2154. 12

[99] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd,R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner, “Micro-scopic traffic simulation using sumo,” in The 21st IEEE InternationalConference on Intelligent Transportation Systems. IEEE, 2018. 12

[100] C. Quiter and M. Ernst, “deepdrive/deepdrive: 2.0,” Mar. 2018.[Online]. Available: https://doi.org/10.5281/zenodo.1248998 12

[101] Nvidia, “Drive Constellation now available,” https://blogs.nvidia.com/blog/2019/03/18/drive-constellation-now-available/, 2019, [accessed14-April-2019]. 12

[102] A. S. et al., “Multi-Agent Autonomous Driving Simulator built ontop of TORCS,” https://github.com/madras-simulator/MADRaS, 2019,[Online; accessed 14-April-2019]. 12

[103] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow:Architecture and benchmarking for reinforcement learning in trafficcontrol,” CoRR, vol. abs/1710.05465, 2017. 12

[104] E. Leurent, “A collection of environments for autonomous driv-ing and tactical decision-making tasks,” https://github.com/eleurent/highway-env, 2019, [Online; accessed 14-April-2019]. 12

[105] F. Rosique, P. J. Navarro, C. Fernández, and A. Padilla, “A systematicreview of perception system and simulators for autonomous vehiclesresearch,” Sensors, vol. 19, no. 3, p. 648, 2019. 10

[106] M. Cutler, T. J. Walsh, and J. P. How, “Reinforcement learning withmulti-fidelity simulators,” in 2014 IEEE International Conference on

17

Robotics and Automation (ICRA). IEEE, 2014, pp. 3888–3895. 10,12

[107] F. C. German Ros, Vladlen Koltun and A. M. Lopez, “Carla au-tonomous driving challenge,” https://carlachallenge.org/, 2019, [Online;accessed 14-April-2019]. 10

[108] W. G. Najm, J. D. Smith, M. Yanagisawa et al., “Pre-crash scenario ty-pology for crash avoidance research,” United States. National HighwayTraffic Safety Administration, Tech. Rep., 2007. 10

[109] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neuralnetwork,” in Advances in neural information processing systems, 1989.10

[110] D. Pomerleau, “Efficient training of artificial neural networks forautonomous navigation,” Neural Computation, vol. 3, no. 1, 1991. 10

[111] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “Endto end learning for self-driving cars,” in NIPS 2016 Deep LearningSymposium, 2016. 10

[112] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner,L. Jackel, and U. Muller, “Explaining how a deep neural net-work trained with end-to-end learning steers a car,” arXiv preprintarXiv:1704.07911, 2017. 10

[113] M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles forautonomous vehicles from demonstration,” in Robotics and Automation(ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp.2641–2646. 10

[114] S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learningto drive using inverse reinforcement learning and deep q-networks,” inNIPS Workshops, December 2016. 10

[115] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, andD. Meger, “Deep reinforcement learning that matters,” in Thirty-SecondAAAI Conference on Artificial Intelligence, 2018. 10

[116] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generatingadversarial driving scenarios in high-fidelity simulators,” in 2019 IEEEInternational Conference on Robotics and Automation (ICRA). ICRA,2019. 10

[117] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrish-nan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulationand domain adaptation to improve efficiency of deep robotic grasping,”in 2018 IEEE International Conference on Robotics and Automation(ICRA). IEEE, 2018, pp. 4243–4250. 11

[118] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018IEEE international conference on robotics and automation (ICRA).IEEE, 2018, pp. 1–8. 11

[119] Z. W. Xinlei Pan, Yurong You and C. Lu, “Virtual to real rein-forcement learning for autonomous driving,” in Proceedings of theBritish Machine Vision Conference (BMVC), G. B. Tae-Kyun Kim,Stefanos Zafeiriou and K. Mikolajczyk, Eds. BMVA Press, September2017. 11

[120] A. Bewley, J. Rigley, Y. Liu, J. Hawke, R. Shen, V.-D. Lam, andA. Kendall, “Learning to drive from simulation without real worldlabels,” in 2019 International Conference on Robotics and Automation(ICRA). IEEE, 2019, pp. 4818–4824. 11

[121] J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, andW. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation forvisual control,” IEEE Robotics and Automation Letters, vol. 4, no. 2,pp. 1148–1155, 2019. 11

[122] H. Chae, C. M. Kang, B. Kim, J. Kim, C. C. Chung, and J. W.Choi, “Autonomous braking system via deep reinforcement learning,”2017 IEEE 20th International Conference on Intelligent TransportationSystems (ITSC), pp. 1–6, 2017. 12

[123] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, andN. de Freitas, “Sample efficient actor-critic with experience replay,” in5th International Conference on Learning Representations, ICLR 2017,Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017. 12

[124] R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez,and K. Goldberg, “Composing meta-policies for autonomous driv-ing using hierarchical deep reinforcement learning,” arXiv preprintarXiv:1711.01503, 2017. 12

[125] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learningdomains: A survey,” Journal of Machine Learning Research, vol. 10,no. Jul, pp. 1633–1685, 2009. 12

[126] D. Isele and A. Cosgun, “Transferring autonomous driving knowledgeon simulated and real intersections,” in Lifelong Learning: A Reinforce-ment Learning Approach,ICML WORKSHOP 2017, 2017. 12

[127] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo,R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learningto reinforcement learn,” Complete CogSci 2017 Proceedings, 2016. 12

[128] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, andP. Abbeel, “Fast reinforcement learning via slow reinforcement learn-ing,” arXiv preprint arXiv:1611.02779, 2016. 12

[129] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in Proceedings of the 34thInternational Conference on Machine Learning - Volume 70, ser.ICML’17. JMLR.org, 2017, p. 1126–1135. 12

[130] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learningalgorithms,” CoRR, abs/1803.02999, 2018. 12

[131] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, andP. Abbeel, “Continuous adaptation via meta-learning in nonstationaryand competitive environments,” in 6th International Conference onLearning Representations, ICLR 2018, Vancouver, BC, Canada, April30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,2018. 12

[132] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policyevolution,” in Advances in Neural Information Processing Systems,2018. 12

[133] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,”in Proceedings of the thirteenth international conference on artificialintelligence and statistics, 2010, pp. 661–668. 12

[134] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning todrive by imitating the best and synthesizing the worst,” in Robotics:Science and Systems XV, 2018. 12

[135] T. Buhet, E. Wirbel, and X. Perrotton, “Conditional vehicle trajectoriesprediction in carla urban environment,” in Proceedings of the IEEEInternational Conference on Computer Vision Workshops, 2019. 13

[136] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under rewardtransformations: Theory and application to reward shaping,” in ICML,vol. 99, 1999, pp. 278–287. 13

[137] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-forcement learning,” in Proceedings of the Twenty-first InternationalConference on Machine Learning, ser. ICML ’04. ACM, 2004. 13

[138] N. Chentanez, A. G. Barto, and S. P. Singh, “Intrinsically motivatedreinforcement learning,” in Advances in neural information processingsystems, 2005, pp. 1281–1288. 13

[139] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-drivenexploration by self-supervised prediction,” in International Conferenceon Machine Learning (ICML), vol. 2017, 2017. 13

[140] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A.Efros, “Large-scale study of curiosity-driven learning,” arXiv preprintarXiv:1808.04355, 2018. 13

[141] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-endsimulated driving,” in Proceedings of the Thirty-First AAAI Conferenceon Artificial Intelligence, San Francisco, California, USA., 2017, pp.2891–2897. 13

[142] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprintarXiv:1610.03295, 2016. 13

[143] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforce-ment learning and safety based control for autonomous driving,” arXivpreprint arXiv:1612.00147, 2016. 13

[144] C. Ye, H. Ma, X. Zhang, K. Zhang, and S. You, “Survival-oriented re-inforcement learning model: An effcient and robust deep reinforcementlearning algorithm for autonomous driving problem,” in InternationalConference on Image and Graphics. Springer, 2017, pp. 417–429. 13

[145] J. Garcıa and F. Fernández, “A comprehensive survey on safe rein-forcement learning,” Journal of Machine Learning Research, vol. 16,no. 1, pp. 1437–1480, 2015. 13

[146] P. Palanisamy, “Multi-agent connected autonomous driving using deepreinforcement learning,” in 2020 International Joint Conference onNeural Networks (IJCNN). IEEE, 2020, pp. 1–7. 13

[147] S. Bhalla, S. Ganapathi Subramanian, and M. Crowley, “Deep multiagent reinforcement learning for autonomous driving,” in Advances inArtificial Intelligence, C. Goutte and X. Zhu, Eds. Cham: SpringerInternational Publishing, 2020, pp. 67–78. 13

[148] A. Wachi, “Failure-scenario maker for rule-based agent using multi-agent adversarial reinforcement learning and its application to au-tonomous driving,” arXiv preprint arXiv:1903.10654, 2019. 13

[149] C. Yu, X. Wang, X. Xu, M. Zhang, H. Ge, J. Ren, L. Sun, B. Chen, andG. Tan, “Distributed multiagent coordinated learning for autonomousdriving in highways based on dynamic coordination graphs,” IEEETransactions on Intelligent Transportation Systems, vol. 21, no. 2, pp.735–748, 2020. 13

18

[150] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford,J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines,”https://github.com/openai/baselines, 2017. 14

[151] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, andD. Lange, “Unity: A general platform for intelligent agents,” arXivpreprint arXiv:1809.02627, 2018. 14

[152] I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcementlearning coach,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1134899 14

[153] Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, PabloCastro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina,Neal Wu, Chris Harris, Vincent Vanhoucke, Eugene Brevdo,“TF-Agents: A library for reinforcement learning in tensorflow,”https://github.com/tensorflow/agents, 2018, [Online; accessed 25-June-2019]. [Online]. Available: https://github.com/tensorflow/agents 14

[154] A. Stooke and P. Abbeel, “rlpyt: A research code base for deepreinforcement learning in pytorch,” arXiv preprint arXiv:1909.01500,2019. 14

[155] I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva,K. McKinney, T. Lattimore, C. Szepezvari, S. Singh et al., “Behavioursuite for reinforcement learning,” arXiv preprint arXiv:1908.03568,2019. 14

B Ravi Kiran is the technical lead of machinelearning team at Navya, designing and deployingrealtime deep learning architectures for perceptiontasks on autonomous shuttles. During his 6 yearsin academic research, he has worked on DNNs forvideo anomaly detection, online time series anomalydetection, hyperspectral image processing for tumordetection. He finished his PhD at Paris-Est in 2014entitled Energetic lattice based optimization whichwas awarded the MSTIC prize. He has worked inacademic research for over 4 years in embedded

programming and in autonomous driving. During his career he has publishedover 40 articles and journals.

Ibrahim Sobh Ibrahim has more than 20 years ofexperience in the area of Machine Learning andSoftware Development. Dr. Sobh received his PhDin Deep Reinforcement Learning for fast learningagents acting in 3D environments. He received hisB.Sc. and M.Sc. degrees in Computer Engineeringfrom Cairo University Faculty of Engineering. HisM.Sc. Thesis is in the field of Machine Learn-ing applied on automatic documents summarization.Ibrahim has participated in several related nationaland international mega projects, conferences and

summits. He delivers training and lectures for academic and industrial entities.Ibrahim’s publications including international journals and conference papersare mainly in the machine and deep learning fields. His area of researchis mainly in Computer vision, Natural language processing and Speechprocessing. Currently, Dr. Sobh is a Senior Expert of AI, Valeo.

Victor Talpaert is a PhD student at U2IS, EN-STA Paris, Institut Polytechnique de Paris, 91120Palaiseau, France. His PhD is directed by BrunoMonsuez since 2017, the lab speciality is in roboticsand complex systems. His PhD is co-sponsored byAKKA Technologies in Gyuancourt, France, throughthe guidance of AKKA’s Autonomous SystemsTeam. This team has a large focus on AutonomousDriving (AD) and the automotive industry in general.His PhD subject is learning decision making for AD,with assumptions such as a modular AD pipeline,

learned features compatible with classic robotic approaches and ontologybased hierarchical abstractions.

Patrick Mannion is a permanent member of aca-demic staff at National University of Ireland Gal-way, where he lectures in Computer Science. He isalso Deputy Editor of The Knowledge EngineeringReview journal. He received a BEng in Civil En-gineering, a HDip in Software Development and aPhD in Machine Learning from National Universityof Ireland Galway, a PgCert in Teaching & Learningfrom Galway-Mayo IT and a PgCert in Sensorsfor Autonomous Vehicles from IT Sligo. He is aformer Irish Research Council Scholar and a former

Fulbright Scholar. His main research interests include machine learning, multi-agent systems, multi-objective optimisation, game theory and metaheuristicalgorithms, with applications to domains such as transportation, autonomousvehicles, energy systems and smart grids.

Ahmad El Sallab Ahmad El Sallab is the SeniorChief Engineer of Deep Learning at Valeo Egypt,and Senior Expert at Valeo Group. Ahmad has 15years of experience in Machine Learning and DeepLearning, where he acquired his M.Sc. and Ph.D.on 2009 and 2013 in the field. He has worked forreputable multi-national organizations in the industrysince 2005 like Intel and Valeo. He has over 35publications and book chapters in Deep Learningin top IEEE and ACM journals and conferences, inaddition to 30 patents, with applications in Speech,

NLP, Computer Vision and Robotics.

Senthil Yogamani is an Artificial Intelligence ar-chitect and technical leader at Valeo Ireland. Heleads the research and design of AI algorithms forvarious modules of autonomous driving systems.He has over 14 years of experience in computervision and machine learning including 12 years ofexperience in industrial automotive systems. He isan author of over 90 publications and 60 patentswith 1300+ citations. He serves in the editorial boardof various leading IEEE automotive conferencesincluding ITSC, IV and ICVES and advisory board

of various industry consortia including Khronos, Cognitive Vehicles and ISAuto. He is a recipient of best associate editor award at ITSC 2015 and bestpaper award at ITST 2012.

Patrick Pérez is Scientific Director of Valeo.ai,a Valeo research lab on artificial intelligence forautomotive applications. He is currently on the Edi-torial Board of the International Journal of ComputerVision. Before joining Valeo, Patrick Pérez has beenDistinguished Scientist at Technicolor (2009-2918),researcher at Inria (1993-2000, 2004-2009) and atMicrosoft Research Cambridge (2000-2004). Hisresearch interests include multimodal scene under-standing and computational imaging.