Reinforcement Learning Based Mobility Adaptive Routing for...
Transcript of Reinforcement Learning Based Mobility Adaptive Routing for...
Reinforcement Learning Based Mobility AdaptiveRouting for Vehicular Ad-Hoc Networks
Jinqiao Wu1 • Min Fang1 • Xiao Li1
Published online: 7 May 2018� Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract Vehicular ad-hoc networks (VANETs) is drawing more and more attentions in
intelligent transportation system to reduce road accidents and assist safe driving. However,
due to the high mobility and uneven distribution of vehicles in VANETs, multi-hops
communication between the vehicles is still particularly challenging. Considering the
distinctive characteristics of VANETs, in this paper, an adaptive routing protocol based on
reinforcement learning (ARPRL) is proposed. Through distributed Q-Learning algorithm,
ARPRL constantly learns and obtains the fresh network link status proactively with the
periodic HELLO packets in the form of Q table update. Therefore, ARPRL’s dynamic
adaptability to network changes is improved. Novel Q value update functions which take
into account the vehicle mobility related information are designed to reinforce the Q values
of wireless links by exchange of HELLO packets between neighbor vehicles. In order to
avoid the routing loops caused in Q learning process, the HELLO packet structure is
redesigned. In addition, reactive routing probe strategy is applied in the process of learning
to speed up the convergence of Q learning. Finally, the feedback from the MAC layer is
used to further improve the adaptation of Q learning to the VANETs environment. Through
simulation experiment result, we show that ARPRL performs better than existing protocols
in the form of average packet delivery ratio, end-to-end delay and number hops of route
path while network overhead remains within acceptable ranges.
Keywords VANET � Adaptive routing � Reinforcement learning � Q-Learning
& Min [email protected]
Jinqiao [email protected]
Xiao [email protected]
1 School of Computer Science and Technology, Xidian University, No. 2, South Taibai Street,Xi’an 710071, Shanxi, People’s Republic of China
123
Wireless Pers Commun (2018) 101:2143–2171https://doi.org/10.1007/s11277-018-5809-z
1 Introduction
Vehicular ad-hoc networks (VANETs) [1] is a specific type of mobile ad hoc networks
(MANETs) with the aim of providing vehicle-to-vehicle (V2V) and vehicular-to-infras-
tructure (V2I) communications in order to reduce the traffic congestion and avoid road
accident. For V2V communications, vehicles usually communicate with each other in the
nonexistence of Road-Side Units (RSUs) and hence multi-hops data transmission in
VANETs is still a quite challenging issue and an efficient routing algorithm adapting to the
VANETs environment is necessary.
Routing in VANETs [2–6] is considered as one of the most important processes, which
allows some special applications designed for VANETs users to provide them with specific
services. Routing in VANETs is to select the optimal paths from the source vehicle to the
designation vehicle through a set of intermediate vehicles. In order to support reliable and
real time transmission of messages in some cases of emergency, the routing protocol
should forwards data packets with high reliability and low delay. However, due to the
frequent changing topology caused by high mobility of vehicles, the existing traditional
routing protocols become increasingly unreliable or even fail, which results in a delayed
reach or even loss of data packets. Therefore, it is imperative to design a robust routing
protocols for VANETs which guarantee high efficiency of communications between
vehicle nodes.
Considering the distinct characteristics of VANETs, such as high-speed movement of
vehicle nodes, fast network topology change, uneven distribution of vehicles and short link
connection duration time, the traditional dynamic adaptive routing protocol designed for
MANETs is no longer suitable for VANETs network environment. Therefore, without
significantly increasing the routing control information interaction between vehicles, it is
difficult for the vehicle nodes in VANETs to perceive the continuously change of the
network topology in time and make a reasonable corresponding optimal routing decision
and self-configuration via existing routing strategy. Reinforcement learning is one of the
ways to solve this problem. Designing such a new efficient and effective routing protocol
for VANETs based on reinforcement learning, which takes into account the vehicles’
mobility related information, can enhance the adaptability of the routing protocol to
VANETs.
Reinforcement learning [7] is increasingly applied to deal with dynamic routing
problems. Q-Learning [8] algorithm is one of the most commonly used form of rein-
forcement learning algorithms which can achieve optimal decisions through the continuous
interaction with the environment without having to know the environment model in
advance. By periodicly exploration of the environment, the agents will eventually obtain
the optimal mapping from environment states to available actions in these states. For the
dynamic routing problem in VANETs, the whole VANETs can be regarded as the envi-
ronment. Each vehicle in the VANETs can be modeled as the agent. The process of packet
forwarding which each vehicle participates in can be considered as the interaction between
the vehicle nodes and its corresponding network. Each packet forwarding, whether control
packet or user data packet, means the regaining of the newest state of network.
In this paper, in order to adapt to the rapid mobility of vehicle nodes in VANETs, we
propose ARPRL, an adaptive routing protocol based on reinforcement learing. ARPRL
takes into account the position related information, such as vehicle position, relative
velocity and direction, to learn the optimal path between the source and destination
vehicles. To respond quickly to the rapid change of topology, each vehicle continuously
2144 J. Wu et al.
123
updates its Q table by periodically sending and receiving control messages. This funda-
mental aspect is referred to as learning through control messages. The most common form
of this kind of control packets in the existing literature is HELLO messages for neighbor
maintenance in most proactive routing protocols, such as optimized link-state routing
(OLSR) [9]. To accurately detect path break in the ongoing data transmission process,
another auxiliary learning aspect, referred to as learning through DATA Packets, is
extremely crucial for efficient data packet routing in highly dynamic environment, such as
VANETs. The last aspect, known as learning through feedback signal, is also to be con-
sidered for reliable data dissemination in VANETs. The feedback signal is mainly provided
by the link layer, such as IEEE 802.11 MAC.
The main contributions of this paper are as follows:
1. A novel mobility adaptive routing protocol suitable for VANET environment based on
distributed Q-Learning algorithm is proposed, in which each vehicle proactively learns
the network status through Q-Learning algorithm to further improves the dynamic
adaptability of the protocol. In order to enhance the efficiency of Q-Learning, The
periodical broadcast of redesigned HELLO packet, forwarding of user DATA and
notification of MAC layer packet loss are considered as trigger source of Q
table update.
2. A routing learning probe approach is adopted to speed up the convergence of Q
learning. Accordingly, the routing delay is reducted. The packet forwarding process is
also contributed to the update of the Q table, which further improves the dynamic
adaptability of the proposed protocol.
3. A new HELLO package structure is designed to avoid the generation of routing loops
in the learning process. Consequently, routing hops are optimized and comprehensive
performance is improved.
The remainder of the paper is organized as follows. Section 2 investigates related state
of the art. Reinforcement learning and Q-Learning model for routing problem is introduced
in Sect. 3. Section 4 gives elaborated description of the proposed protocol ARPRL.
Simulation results are presented in Sect. 5. Section 6 analyzes the complexity of the
protocol, which is followed by conclusions in Sect. 7.
2 Related Work
In VANETs, each vehicle moves along the roads. As a result, the V2V communications are
highly susceptible to frequent link breaks. To solve this problem, various routing protocols
have been proposed in recent decays.
The most intuitive way to resolve the routing problem in VANETs is to apply the
exiting routing protocols which are designed for mobile ad hoc networks (MANETs) [10].
In MANETs, routing protocols can be classifed into two main categories according to the
routing discovery criterion. The first is topology-based routing which can be further sub-
divided into proactive and reactive routing. This type of routing is represented by the
OLSR and the ad hoc on-demand distance vector (AODV) [11]. In OLSR, HELLO
messages need to be sent periodically to detect the joining and leaving of neighboring
nodes and routing information is also need to be exchanged periodically between neigh-
bour nodes to obtain the global network topology. More importantly, regardless of whether
the node needs to send data, each node maintains the routing path information for each
Reinforcement Learning Based Mobility Adaptive Routing for… 2145
123
other nodes in the network. In AODV, routing information is updated on demand. How-
ever, AODV needs to flood RREQ messages throughout the entire network to search a new
routing path once the old one is interrupted, which will cause considerable routing over-
head in VANETs. In addition, the AODV will not switch to the suboptimal path or to
preemptively reestablish a new one until the current active route is not available. The
second is hybrid ad-hoc routing, which is represented by the Zone Routing Protocol (ZRP)
[12]. Unfortunately, for the same reason as the AODV and OLSR mentioned above, ZRP is
also not be suitable for routing packets in VENETs.
Some other routing protocols, which are often referred to as geographic-based routing
[13–17], rely on the location information of neighbor nodes for packet forwarding, e.g.
Greedy Perimeter Stateless Routing (GPSR) [18] , which does not need to establish a
routing table for packet forwarding. Unlike topology-based routing, GPSR always forwards
the packet in the direction closest to the destination node and does not need to send any
routing control packets. However, GPSR relies on accurate nodal location information and
is also prone to generate route loops caused by node high movement.
Some cluster-based routing protocols have been proposed to address VANETs routing
problems [19–22]. Most clustering routing algorithms designed for VANETs are originated
from MANET. However, the clustering based routing protocol [23] applicable to MANET
may not satisfy the dynamic characteristics of VANETs. LID (Lowest ID clustering
algorithm) [24] is a simple clustering algorithm proposed by Grela and Tsai et al. Each
node is assigned a unique identifier (ID) across the entire network. The cluster header node
is prefered with the node with the smallest ID. The disadvantage of LID is that the cluster
head node may become the system performance bottleneck if the time it serves as the
cluster head role is too long. Distributed Clustering algorithm (DCA) [25] selects the
cluster head based on the nodes weight. The weight may be a function of the node
transmission range or the node mobility factor. A hybrid clustering routing approach [26]
for VANETs has been proposed to achieve dynamic routing on the basis of vehicle ID,
vehicle location ID and vehicle lifetime. Another dynamic clustering routing [27] for
VANETs is proposed based on the vehicle connectivity degree and mobility metrics. The
proposed scheme considers the vehicles on a specific lane between two junctions to form
dynamic and stable clusters.
In recent years, reinforcement learning is increasingly applied in dynamic routing.
Boyan and Littman proposed a QRouting algorithm [28] for a irregularly connected wired
network composed of 36 nodes. Dowling et al. [29] have proposed a routing protocol
SAMPLE for MANET based on collaborative reinforcement learning. Unfortunately,
SAMPLE does not take into account the frequent link break in MANET. In SAMPLE,
similar to the DSR, the routing information to be broadcast is added to the data packet
header, hence it is not suitable for heavy data packet traffic applications. Based on existing
AODV, Celimuge WU et al. proposed an improved Q learning routing protocol QLAODV
[30] (Q-Learning AODV) to efficiently deal with routing issue in highly dynamic network
such as MANET. To address the slow convergence caused by Q learning algorithms, Plate
Randall et al. presented a Q-learning based routing approach QKS [31] (Q-learning uti-
lizing Kinematics and Sweeping) associated with kinematic and sweeping features for
underwater network. Santhi et al. proposed a MANET multicast routing protocol
QLMAODV [32] (Q-Learning MAODV [33]) by applying Q learning algorithm to the
existing MAODV protocol. Based on distributed Q learning, QLMAODV learns the net-
work status information and improves the performance of MAODV by preemptively
choosing a sub-optimal route before the current active route becomes invalid. However,
QLMAODV is designed for MANET and solves multicast routing problem.
2146 J. Wu et al.
123
3 Reinforcement Learning
3.1 Markov Decision Process
Reinforcement Learning is an efficient approach to solve the sequential decision task,
which can be represented as a Markov Decision Process (MDP) process [34]. Generally, a
MDP process contains the following: (a) A set of discrete environment states S; (b) A set of
discrete actions A available for agents in a specific state s; (c) An environment model
Tðs; a; s0Þ (s; s0 2 S and a 2 A); and (d) A reward function Rðs; a; s0Þ. A policy pðsÞ gives
selection of an action in state s. A MDP searches for the optimal policy p�ðsÞ, which
maximizes the expected sum of rewards. The expected sum of rewards is the accumulated
discounted rewards from initial state. Let VpðsÞ denotes the value functions of the state
s 2 S, which can be formulated as:
Vp sð Þ ¼ Ep Rt st¼sjf g
¼ Ep
X1
k¼0
ckrtþk st¼sj( )
¼ Ep rt þ crtþ1 þ � � �f g
ð1Þ
where rt, Ep rtf g and Ep rtþ1f g [see (4)] are defined as:
rt ¼ r s; að Þ s¼st ;a¼at��
¼X
stþ12SPatststþ1
Ratststþ1
ð2Þ
Ep rtf g ¼ Ep r st; atð Þf g¼X
at2Ap st; atð Þ rt ð3Þ
Ep rtþ1f g ¼ Ep r stþ1; atþ1ð Þf g
¼X
at2Ap st; atð Þ
X
stþ12 S
Patststþ1
X
atþ12Ap stþ1; atþ1ð Þrtþ1
" #�����st¼s
8<
:
9=
;ð4Þ
here Patststþ1
satisfies:
X
s02SPass0 ¼ 1
in which
Pass0 ¼ Pr stþ1 ¼ s0 st¼s;at¼a
��� �
In (4), at denotes an action executed in the state st at time t; Patststþ1
denotes the prob-
ability of transforming to the next state stþ1 from the current state st when taking action at.
Ratststþ1
is the immediate reward for taking the action at at state st and arriving at state stþ1.
From Bellman’s optimality rule in Dynamic Programming, once the state transition and
value function are known, the optimal solution can be obtained. Hence, substituting (4) into
(1), we can obtain the optimal value function in the form of Bellman equations as follows:
Reinforcement Learning Based Mobility Adaptive Routing for… 2147
123
V� sð Þ ¼ maxp
VpðsÞ
¼ maxp
Ep
X1
k¼0
ckrtþk
�����st¼s
8<
:
9=
;
¼ maxp
Ep rt þX1
k¼1
ckrtþk
�����st¼s
8<
:
9=
;
¼ maxa
rt þ cX
stþ12SPaststþ1
Vpðstþ1Þ�����st¼s
24
35
ð5Þ
where v�ðsÞ denotes the optimal value in state s and p is a mapping from states set S to
actions set A.
3.2 Q-Learning
Based on (5), the value function can also be defined as the function of state-action pairs
that estimates the quality of it when performs a an action a in a state s, which is named
Q function and can be denoted as Q(s, a). Then (5) can be revised as below when applying
a policy p:
Qp s; að Þ ¼ Ep Rtjst¼s; at¼an o
¼ Ep
X1
k¼0
ckrtþk
�����st¼s; at¼a
8<
:
9=
;
¼ Ep rt þX1
k¼1
ckrtþk
�����st¼s; at¼a
8<
:
9=
;
¼ rt þ cX
stþ12SPaststþ1
Vpðstþ1Þ !�����
st¼s
ð6Þ
In order to make V�ðsÞ does not depend on a specific policy p, we define:
V� sð Þ ¼ maxa
Q� s; að Þ ð7Þ
Then substituting (7) into (6), we can get:
Q� st; atð Þ ¼ rt þ cX
stþ12SPaststþ1
maxa
Q�ðstþ1; aÞ ð8Þ
However, in practice, the environment model is not known a priori. In such circum-
stances, the optimal value functions can be obtained through TD (temporal difference) and
MC (Monte-Carlo) RL algorithms [35], which are more suitable to get optimal policy.
Q-Learning is well-known as a TD algorithm, which works by evaluating the stateaction
pair values through interactions with the environment without knowing its model in
advance. Based on the above analysis, Eq. (8) can be written as:
2148 J. Wu et al.
123
Q st; atð Þ 1� að ÞQ st; atð Þ
þ a rt þ cmaxa
Q stþ1; að Þh i ð9Þ
where ð0� a� 1Þ is the learning rate, which limits how rapidly learning process can
proceed. The rt is the immediate feedback from the environment. The discount factor
ð0� c� 1Þ determines how important of the future Q-values are.
3.3 Routing Based on Q-Learning
Based on Q-Learning, the routing problem can be modeled as follows. The entire network
can be modeled as the environment. Packets forwarding by the intermediate nodes cor-
responds to the interactions process in Q-Learning. Each packet denotes an agent. Each
node can be considered as a one state for the packet when it is forwarded in the network.
All the network nodes composed of the states space for all forwarding packets. The next
hop selection from one-hop neighbors can be regarded as an action. Therefore the available
set of actions at a node for a packet is its one-hop neighbors. Obviously, the state tran-
sitions can be mapped to the forwarding of a packet from one node to its one neighbor.
However, the learning task must be done in a distributed way for each node since a global
view of network state is impossible. Once the optimal next hop for a outgoing packet is
determined, the node will get back the reward.
According to (9), combined with the routing problem, we can intuitively revise it as:
Qs d; xð Þ 1� að ÞQs d; xð Þ
þ a Rþ c � maxy2N xð Þ
Qx d; yð Þ� � ð10Þ
where Qsðd; xÞ denotes the Q value of source node s for destination node d through its
neighbor x. N(x) denotes one-hop neighbors of x and R is the instant reward for s to forward
the packet to d through x. a is learning rate and it controls the rate of learning task. The
higher the value of a, the faster the Q value is updated and the better the adaptability to the
dynamic characteristics of the network. However, if a is too large, the rewards will mislead
the packet forwarding because agents can learn the incorrect immediate rewards in some
cases. c is the discount factor and it denotes the importance of future rewards. If c is too
low, immediate rewards are dominated, while higher values of c will lead to overly count
on future rewards.
4 The Proposed Protocol
In this section, we provide an elaborated description of ARPRL. ARPRL is designed for
VANETs and aims to find a optimal route between the source and destination. ARPRL
employs distribute Q-learning algorithm to learn the best multi-hop route considering the
distinguish characteristics of VANETs.
4.1 Assumptions
In the description of the ARPRL algorithm, we make the following network operational
assumptions: Each vehicle has a unique ID which is range from 1 to N, where N denotes
Reinforcement Learning Based Mobility Adaptive Routing for… 2149
123
the number of vehicles. Each vehicle also has the location and velocity information of itself
through GPS and their one-hop neighbors via periodically HELLO message. MAC layer’s
packet loss will be fed back to the network layer. Each node maintains two tables: one is a
neighbor table and the other is a Q table. The neighbor table stores neighbor dynamic
information and is updated through HELLO message. The Q table is used to route packets
and is updated through HELLO, DATA forwarding and MAC layer’s feedback signal.
Each vehicle periodically exchanges optimal part of Q table information with its neighbor
nodes. Every vehicle is aware of the joining and leaving of its neighbors by HELLO timer.
4.2 ARPRL Protocol Overview
In ARPRL, each vehicle periodically exchanges HELLO packets with its neighbors to
learn real-time changes in network topology. This is one of the most important ways of
dynamic learning. On receiving of a HELLO message, each vehicle updates its Q
table according to the information contained in the HELLO packet. Each HELLO packet
includes the sender vehicles position, velocity and some Q values extracted from Q table.
When a source vehicle has a packet to send to a destination vehicle, it first searches its Q
table for a valid next hop. If none exists, ARPRL initiates route probe process, which
closely resembles to that of AODV, to reactively learn an optimal route to the destination.
Each vehicle in the network acts as a learning agent and continuously gathers network link
state information through interacting with its neighbor vehicles by exchanging of HELLO
packets. Data packets are forwarded to the neighbor vehicle with maximal Q value. Since
the neighbor is determined by the Q value and the Q table is updated upon periodic
exchange of HELLO packets, DATA forwarding and MAC layer’s feedback signal, the
neighbor selected as the next hop is always the best.
4.3 Route Probe for Boosting Convergence of Q-Learning
Convergence is a key issue for Q-Learning. Therefore, at the beginning of the learning
process, a proactively learning mechanism is adopted as a supplement to speed up the
Q-Learning convergence. When a route to a destination vehicle is needed, the source
vehicle broadcasts a learning probe request (LPREQ) packet. Upon the first receiving of the
LPREQ packet, each intermediate vehicle rebroadcasts the LPREQ packet and update
corresponding Q value until it reach to the destination vehicle. Then the destination vehicle
unicasts a learning probe reply (LPREP) packet to the source vehicle by consulting the just
learned Q table. Algorithm 1 presents the pseudocode for route probe process.
2150 J. Wu et al.
123
Reinforcement Learning Based Mobility Adaptive Routing for… 2151
123
The following example illustrates the route probe process through the LPREQ and
LPREP packets. In the segment of a road network as shown in Fig. 1, each vehicle
maintains a Q Table which consists of Q-values Q(d, x), where d is the destination vehicle
and x is the next hop to d. As illustrated in Fig. 1, all of the Q-Table values are initialized
to 0 except the one where the destination and next hop is equal to the current vehicle. Take
the Q table of source vehicle S in Fig. 1 as an example, the QsðS; SÞ is set to 1, which is
marked by dark green cell as indicated in Fig. 1, while the other values are set to 0. When
S initially send packets to D, S first broadcasts an LPREQ message. On receiving of the
LPREQ message, the V1’s Q table value QV1ðS; SÞ is updated according to (10). For the
sake of simplicity, here the learning rate a and discount factor c in (10) is equal to 0.9 and
1.0, respectively. The constant reward R is equal to 100 if the LPREQ message is broadcast
by S, otherwise R is set to 0. It is to be noted that each row in the Q table represents the
destination node and each column denotes the next hop. In addition, there is a specific row,
which the destination node is equal to itself in the Q table, is used as route flag row.
Therefore QV1ðS; SÞ can be calculated according to (10) as follows:
QV1ðS; SÞ ¼ ð1� 0:9Þ � 0þ 0:9 � f100þ 1 � 0g ¼ 90
After QV1ðS; SÞ is updated, the corresponding route flag QV1
ðV1; SÞ is set to 1. Since V1
is not the destination vehicle node, it needs to continue to rebroadcast the LPREQ message
to its neighbors V2 and V3. Then V2 and V3 update their respective Q table as V1. However,
unlike V1, V2 and V3 will update QV2ðS;V1Þ and QV3
ðS;V1Þ in addition to update
QV2ðV1;V1Þ and QV3
ðV1;V1Þ as V1. QV2ðS;V1Þ and QV3
ðS;V1Þ can be calculated from (10)
as follows:
QV2ðs;V1Þ ¼ ð1� 0:9Þ � 0þ 0:9 � f0þ 1 � 90g ¼ 81
QV3ðS;V1Þ ¼ ð1� 0:9Þ � 0þ 0:9 � f0þ 1 � 90g ¼ 81
Then, the corresponding route flagQV2ðV2; SÞ, QV2
ðV2;V1Þ, QV3ðV3; SÞ and QV3
ðV3;V1Þ is
also be set to 1. Eventually, the LPREQ message successfully reach to D and the updated Q
Fig. 1 A segment of road network and each vehicle’s initial Q-Table
2152 J. Wu et al.
123
table in each vehicle is shown as Fig. 2, in which the light blue cell indicates the previous
step Q table status ; the dark green and yellow cell indicates the new marked route flag and
Q value that have learned through the LPREQ message. It is worth noting that, for the
QDðS;V3Þs update, obviously from Fig. 2, we can have:
max QV3S; yð Þ
y2N V3ð Þ¼ QV3
S;V1ð Þ
In addition, QV2ðS;V1Þ is contained in the LPREQ message which will be broadcast by
V3. Thereafter upon receiving of the LPREQ message sent by V3, D will extract QV3ðS;V1Þ
from the LPREQ message and update QDðS;V3Þ as follows:
QDðS;V3Þ ¼ ð1� 0:9Þ � 0þ 0:9 � f0þ 1 � 81g ¼ 72:9
Upon receiving of the LPREQ message, D will responds with a LPREP packet back to
S. The LPREP message is backward routed in the intermediate vehicles according to the
learned knowledge. Upon receiving of the LPREP message, the intermediate vehicles
update corresponding Q value according to (10). Then a path between S and D is dis-
covered through the LPREQ and LPREP messages. The updated Q table values of each
vehicle are shown in Fig. 5. According to the Q table, the new discovered path from S to
D is S! V1 ! V3 ! D, as indicated in Fig. 3.
4.4 Format Design and Handle of HELLOMessage for Route Loop Reduction
The structure of HELLO packets is depicted in Fig. 4. The information contained in a
HELLO message include: ID, position, velocity, create time and an array of QMax. Each
QMax item consists of three fields: destination vehicle, Q value and next hop. The Nex-
tHop field is the key to avoid single hop route loop. Algorithm 2 presents the pseudo code
for HELLO message processing.
Fig. 2 Q Table after Broadcast of LPREQ from S
Reinforcement Learning Based Mobility Adaptive Routing for… 2153
123
Fig. 3 Q Table after Unicast of LPREP from D to S
Fig. 4 HELLO Construction according to Q table
2154 J. Wu et al.
123
Each vehicle maintains a HELLO timer which is triggered once in a HELLO_IN-
TERVAL time and constructs a new HELLO message according to the Q table and then
broadcast it when the HELLO timer is timeout. Take V3 as an example, the contents of its
Q table and the corresponding HELLO message are shown in Fig. 4. For the timestamps
field, its value is 39,854 ms which means the broadcast time of the HELLO message.
When receiving a HELLO message, each vehicle will update its Q table according to
the information contained in the HELLO message. The Q values for a specific neighbor
will be reset to 0 if not receiving a HELLO message from a neighbor for a certain time. For
example, the Q values QV3ðS;V1Þ and QV3
ðS;V1Þ is set to 0 when V1 gone out of the range
of V3. Furthermore, the routing flag QV3ðV3;V1Þ is also set to 0 accordingly.
Reinforcement Learning Based Mobility Adaptive Routing for… 2155
123
4.5 Q-Table Maintenance Considering the Characteristics of VANETs
The update of the Q table is mainly through the periodic HELLO packets exchange
between neighbor vehicle nodes as discussed in Sect. 4.4. In addition, the Q Table is also
updated on the reception of LPREQ and LPREP messages for boosting the convergence of
Q Learning algorithm in the route probe process as discussed in Sect. 4.3. Furthermore, the
feedback or acknowledgment mechanism provided by MAC layer is also used to further
assist on the Q table update of hysteresis caused by fast mobility of vehicle nodes.
Considering the specific characteristics of the VANET, we designed a dynamic Q
table update strategy based on (10). Upon receiving of a HELLO message from a neighbor
vehicle n, current vehicle c will update its Q table as follows:
Qc d; nð Þ ¼ ð1� ac;nÞQc d; nð Þ þ ac;n Rc;n þ cc;n � maxy2NeiðnÞ
Qnðd; yÞ� �
ð11Þ
in which, Rc;n is defined as:
Rc;n ¼ C þ HMRRc;n þ LETc;n ð12Þ
where C is a constant with a value of 100. ac;n and cc;n are defined respectively as:
ac;n ¼ max 0:2;vcj j � vnj jj j
vmax � vmin
� �ð13Þ
cc;n ¼PN
n¼1 Rc;n
N; N 6¼ 0
0; N ¼ 0
8<
: ð14Þ
HMRRc;n (Hello Message Reception Ratio) is defined as:
HMRRðc; nÞ ¼100 � CNTrðc; nÞ
CNTsðnÞ; CNTsðnÞ� 15
100 � CNTrðc; nÞCNTsðnÞ
� 1� 1
2
� �CNTsðnÞ !
; other
8>>><
>>>:
where CNTrðc; nÞ and CNTsðnÞ denote the number of receiving and sending hello messages
at c from/to one-hop neighbor n, respectively. Here, we distinguish those neighbors
according to whether the neighbor duration time is less than 15s (in case of CNTsðnÞ\15).
LETc;n is defined as (15)
LETc;n ¼100 A¼ 0 and B¼ 0
min 100;�ðABþCDÞþ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA2þC2ð ÞR2�ðAD�BCÞ2
q
A2þB2
0@
1A otherwise
8>><
>>:
ð15Þ
where
2156 J. Wu et al.
123
A ¼ vc cos hvcð Þ � vx cosðhvxÞB ¼ xc � xn
C ¼ vc sin hvcð Þ � vx sinðhvxÞD ¼ yc � yn
ð16Þ
Upon forwarding a DATA packet which is originated from s from neighbor n, c will
update the corresponding Q table values as follows:
Qc s; nð Þ ¼ ð1� ac;nÞQc s; nð Þ þ ac;n Rc;n þ cc;n � maxy2NeiðnÞ
Qnðs; yÞ� �
ð17Þ
Upon receiving of a MAC layer notification of packet loss from neighbor n, c will
update the corresponding Q table values for each destination di as follows:
Qc di; nð Þ ¼ 0:5 � Qc di; nð Þ ð18Þ
From (11), it should be noted that the link with maximum LET is regarded to be most
stable. To demonstrate the advantage of LET, Fig. 5 shows an intersection scenario in
VANETs. As shown in Fig. 5, The communication between vehicles S and D is possible
through two optional routes: one is via A (Route1: S! A! B! D) and the other via
C (Route2: S! C ! E! D). Since vehicle A is becoming farther and farther away from
S, while vehicle C is continuing straight as S, Route1 is likely to be disconnected after a
certain time due to the neighbor link break (S0 ! A0). Consequently, the neighbor C of S is
more suitable to be selected as the next hop to the path between vehicle S and D.
5 Experiment Results
To conduct the performance evaluation of our proposed protocol ARPRL, we implement it
in network simulator QualNet 7.1 [36]. To compare the performance of ARPRL with that
of AODV [11], QLAODV [30], QROUTING [28] and GPSR [18], we also implement
three other routing protocols(QLAODV, QROUTING, GPSR). In the following section,
we give the performance metrics used to evaluate the routing protocols performance, the
simulation parameters and the analysis of the corresponding results.
5.1 Metrics
We assess the protocols performance by varying the number of vehicles, maximum
velocity and CBR data generation interval in a predefined fixed Manhattan scenario area.
The performance metrics are the following:
Average Packet Delivery Ratio (APDR) This metric is defined as the ratio of the average
number of packets that are successfully received by the destination vehicles to the average
number of packets sent out by the source vehicles. The average packet delivery ratio metric
shows the ability of transferring application traffic data between the source and destination.
Average End-to-End Delay (AEED) This metric is defined as the average time taken for
packets to be successfully transmitted from their source to their destination. The average
end-to-end delay metric indicates the timeliness of the routing protocols transmitting
packets from source to destination.
Reinforcement Learning Based Mobility Adaptive Routing for… 2157
123
Average Hops Count (AHC) This metric is defined as the average number of interme-
diate nodes through which the successfully delivered packets have passed between the
source and the destination. The average hops count metric indicates the severity of traffic
duplication.
5.2 Simulation Setup
In the Manhattan simulation scenario, we use 20 horizontal and 20 vertical streets in the
2000 m 9 2000 m field which forms multiple 500 m 9 500 m grids layout with 25
intersections. For each street, it has 2 lanes in both directions. The vehicles move according
to the instructions of traffic light deployed at the intersections with 30s yellow interval of
signal. We use the VanetMobiSim [37], a well-known and validated framework for
vehicles mobility modeling at both macroscopic and microscopic levels,to generate the
movement of vehicles. The first 1000s of VanetMobiSim output was ignored to reflect
more accurate real movements of vehicles. Since we are focused on the routing protocol
performance and the IEEE 802.11p PHY/MAC modules which are standardized by IEEE
specifically for vehicular communication are not available for QualNet 7.1, we adopt IEEE
802.11a as the lower layers protocol. Other parameters are the default settings of QualNet
7.1, except for those which are shown in Table 1.
Fig. 5 An intersection scenario considering LET in VANET
2158 J. Wu et al.
123
5.3 Simulation Results
5.3.1 Performance for Varying Number of Vehicles
The density of the vehicle nodes in VANETs has significant effect on the protocol per-
formance. In this part, we firstly fix the maximum velocity of the vehicle nodes to 15 m/s.
The CBR packet interval is fixed to 1s. The number of vehicles is varying from 50 to 350 to
indicate different vehicle density. The statistic results are described below.
Figure 6 shows the average packet delivery ratio (APDR) of each protocol with varying
the number of vehicles. From Fig. 6, we can see that the APDR of all the five protocols
Table 1 Simulation settingsParameter Value
Simulator QualNet(v7.1)
Simulation time 900 s
Simulation area 2000 m 9 2000 m
Number of vehicles 50, 100, 150, 200, 250, 300, 350
Minimal vehicle velocity 0 m/s
Maximal vehicle velocity 1, 5, 10, 15, 20, 25, 30 m/s
Transmission range 250 m
Number of CBR flows 20 (randomly selected)
CBR packet interval 0.1, 0.2, 0.6, 1, 2, 4, 6 s
CBR packet size 512 bytes
MAC protocol IEEE 802.11a
Channel frequency 5.885 GHz
Channel data rate 6 Mbps
Propagation model Two Ray
Fig. 6 Average packet delivery ratio versus number of vehicles
Reinforcement Learning Based Mobility Adaptive Routing for… 2159
123
increases as the vehicle density increases when the number of nodes is less than 350 (for
QROUTING it is 300). This is because the connectivity of the network increases as the
node density increases. However, the APDR decreases slightly with the vehicle density
increases when the number of vehicles is higher than 300 (for QROUTING it is 250). The
reason is that the higher the node density, the greater the possibility of channel collisions.
In general, ARPRL outperforms all of the other protocols because of that ARPRL con-
siders the link reliability and vehicle mobility in the dynamic Q learning process. GPSR
only rely on the location of the neighbor vehicles to select the next hop, which easily fall
into the local optimal. Therefore, GPSR have the lowest APDR when the number of
vehicles is less than 300. QROUTING have the lowest APDR when the number of vehicles
is less than 300 because of routing loop and hence excessive conflicts. QLAODV performs
well than AODV in low and medium vehicle density by continuously learning the network
status through broadcast of hello packets. For high vehicle density, the result is reversed
because of high overhead of QLAODV in high vehicle density. ARPRL performs well than
QLAODV in all cases. This is because ARPRL optimizes QLAODV through periodic
learning, on-demand routing probe and the feedback of MAC layer; hence it shows an
advantage over QLAODV. On average, ARPRL improves the APDR by 23.4 and 22.6%
compared with that of QLAODV and AODV, respectively.
Figure 7 shows the Average End-to-End Delay(AEED) of each protocol for the suc-
cessfully delivered CBR packets with varying the number of vehicles. For AODV and
QLAODV, the AEED decreases as the number of vehicles increases from 50 to 350. This is
because the lower the vehicle density, the higher the probability of network partition, in
which packets need to be stored for further forwarding and thus the AEED is increased.
AODV shows highest AEED because of excessive route discoveries incurred by fast
movement of vehicles. For QLAODV, slow convergence and route loops introduced in the
learning process increase the AEED. ARPRL and QROUTING show similar AEED with
GPSR which has the lowest AEED. This is because the proactive Q-table maintenance in
ARPRL and QROUTING can switch the sub-optimal route while AODV and QLAODV
will not change to better routes until current active one breaks. For GPSR and QROUT-
ING, they perform better than ARPRL in the form of less AEED by 6.9 and 2.6 ms ,
respectively, as the result of route probe mechanism of ARPRL which introduces slight
additional delay. However, Compared with QLAODV and AODV, ARPRL reduces the
AEED by 162.2 and 384.8 ms, on average, respectively.
Figure 8 shows the Average Hops Count(AHC) of each protocol for the successfully
delivered CBR packets with varying the number of vehicles. In most cases, the AHC
decreases with the increase of vehicle density for all five protocols when the number of
vehicles is more than 50. This is because frequent network partitions results in increased
routing breaks and loops. In addition, frequent network topology changes also contribute to
this case happening. For QROUTING and ARPRL, the average hops increases with the
increase of vehicle numbers varying from 50 to 150 due to more and more vehicles
participating in the forward of packets. When the vehicle numbers varies from 150 to 350,
the average hops decreases due to more and more vehicles congesting at the intersections,
which contributes to finding a shorter route. In addition, AODV, QLAODV and ARPRL
adopt the route discovery strategy and accordingly have less AHC than that of QROUT-
ING and GPSR in most cases. More importantly, ARPRL shows significant fewer hops
than AODV and QLAODV at high vehicle density due to the routing probe and process of
packet loss notification of MAC layer. Compared with QLAODV and AODV, ARPRL
reduces the AHC by 3.58 and 4.44 hops on average, respectively.
2160 J. Wu et al.
123
5.3.2 Performance for Varying Maximum Velocity
In this part, We evaluate the performance of each protocol by varying the vehicle maxi-
mum velocity from 1 to 30 m/s, while the number of vehicles and CBR packet interval is
fixed to 200 and 1 s, respectively. The statistic results are described below.
Fig. 7 Average end-to-end delay versus number of vehicles
Fig. 8 Average hops count versus number of vehicles
Reinforcement Learning Based Mobility Adaptive Routing for… 2161
123
Figure 9 shows the average packet delivery ratio (APDR) of each protocol with varying
the maximum allowable velocity. From Fig. 9, it can be concluded that the APDR of all
five protocols decreases when the maximum vehicle velocity varies from 1 to 30 m/s. This
is because the increase in vehicle velocity causes more changing network topology and
network partitions in which the number of packets dropped increases due to high vehicle
movement. As the velocity varies from 25 to 30 m/s, the packet delivery ratio of five
protocols tends to increase. The reason is that the packet carry time decreases when the
velocity varies from 25 to 30 m/s. Therefore, the number of packet dropped deceases due
to packet timeout. In ARPRL, we consider not only the number of hops like AODV, but
also overcomes the slow convergence and routing loops of QROUTING. In addition, LET
is also considered in the learning process, which further enhances route reliability. Thus, it
performs better than the other four protocols. On average, ARPRL increases the APDR by
20.3 and 24.8%, compared with that of QLAODV and AODV, respectively.
Figure 10 shows the Average End-to-End Delay(AEED) of each protocol for the suc-
cessfully delivered CBR packets with varying the maximum allowable velocity. Fig. 10
indicates that the AEED of all five protocols increases as the maximum vehicle velocity
varies from 5 to 30 m/s. This is because high mobility leads to rapid changes in network
topology, which further increases the possibility of selection of sub-optimal routing path,
hence increases the delay. In addition, High mobility also aggravates the network parti-
tions, which incurs packet carry and increases delay. When the maximum vehicle velocity
varies from 25 to 30 m/s, The duration of the network partition becomes shorter. Thus, the
packet carry time introduced by network partition is reduced and the AEED also tend to
decrease for all the five protocols. GPSR and QROUTING perform better than ARPRL in
the form of less AEED by 9.4 and 1.8 ms, respectively, due to route probe mechanism of
ARPRL. However, on average, ARPRL reduces the AEED by 112.3 and 284.6 ms,
compared with that of QLAODV and AODV, respectively.
Figure 11 shows the Average Hops Count(AHC) of each protocol for the successfully
delivered CBR packets with varying the maximum allowable velocity. The result shows
that the AHC increases as the maximum vehicle velocity varies from 1 to 10 m/s. This is
because with the velocity increasing, the frequency of route break increases. Since ARPRL
Fig. 9 Average packet delivery ratio versus maximum allowable velocity
2162 J. Wu et al.
123
considers the number of hops and link expire time, it performs better than QLAODV and
AODV. As the maximum vehicle velocity varies from 15 to 30 m/s, the average hops
increases slightly for five protocols. The reason is that high velocity improves the network
connection and reduces the probability of network partition, which resulting in shorter
length of route path. On average, ARPRL reduces the AHC 3.3 and 3.9 hops, compared
with that of QLAODV and AODV, respectively.
Fig. 10 Average end-to-end delay versus maximum allowable velocity
Fig. 11 Average hops count versus maximum allowable velocity
Reinforcement Learning Based Mobility Adaptive Routing for… 2163
123
5.3.3 Performance for Varying Data Generation Interval
After analyzing the effect of vehicle velocity on protocol performance, in this part, we
evaluate each protocol by varying the data generation interval from 0.1 to 6 s, while the
maximum allowable velocity and the number of vehicles is fixed to 15 m/s and 200,
respectively. The statistic results are described below.
In Fig. 12, we evaluate the Average Packet Delivery Ratio(APDR) of each protocol
with varying the data generation interval. As shown in Fig. 12, the APDR of ARPRL,
QLAODV and AODV decreases as the Packet Interval(PI) varies from 0.1 to 6 s. This is
because the increase in PI causes less frequency of route discovery in which the number of
packets dropped increases due to more invalid route path. For QROUTING and GPSR, the
APDR remain approximately constant in all configurations. This is mainly because the
routing path is maintained only through periodic HELLO packets in QROUTING and
GPSR. Fig. 12 also shows that ARPRL achieves the highest APDR in all cases of PI. This
can be explained by the fact that ARPRL combines the advantages of proactive routing
learning through distributed Q-Learning algorithm and reactive routing probe mechanism.
On average, ARPRL delivers 19.0 and 24.0% more packets than QLAODV and AODV,
respectively.
Figure 13 shows the Average End-to-End Delay(AEED) of each protocol for the suc-
cessfully delivered CBR packets with varying the data generation interval. As shown in
Fig. 13, ARPRL achieve a much lower AEED than QLAODV and AODV in all config-
urations of Packet Interval(PI). This is because QLAODV and AODV adopt route dis-
covery mechanism which introduces longer AEED. However, in APRRL, the number of
triggers of route discovery is much less than QLAODV and AODV due to the periodic
routing learning and hence the AEED is further reduced. For GPSR and QROUTING, on
average, they perform better than ARPRL in the form of less AEED by 1.5 and 7.8 ms,
respectively, due to route probe mechanism of ARPRL. However, ARPRL reduces the
AEED by 216.2 and 503.4 ms, on average, compared with that of QLAODV and AODV,
respectively.
Fig. 12 Average packet delivery ratio versus data generation interval
2164 J. Wu et al.
123
Figure 14 shows the Average Hops Count (AHC) of each protocol for the successfully
delivered CBR packets with varying the data generation interval. The AHC of AODV and
QLAODV is much higher than other three protocols as the Packet Interval (PI) varies from
0.1 to 6 s. The higher the CBR data rate, the more difference of AHC. This is expected
since because in both AODV and QLAODV, new better route will not be discovered
immediately until the current active one breaks. For ARPRL, QROUTING and GPSR,
their AHC stays approximately constant in all cases while GPSR has lowest AHC among
them. This is because GPSR adopts greedy forwarding strategy which always progressively
Fig. 13 Average end-to-end delay versus data generation interval
Fig. 14 Average hops count versus data generation interval
Reinforcement Learning Based Mobility Adaptive Routing for… 2165
123
forward packets directly toward the direction of the destination. Although GPSR has the
minimal AHC, it is at the high expense of the cost of packet loss due to the local optimal
caused by greedy forwarding, which can be proved by Fig. 12. In addition, ARPRL has
lower AHC than QROUTING because of the route probe mechanism of ARPRL. On
average, ARPRL reduces the AHC 0.11, 3.4 and 4.3 hops, compared with that of
QROUTING, QLAODV and AODV, respectively.
6 Analysis of ARPRL
In this part, the Average Routing Overhead (ARO) is firstly evaluated and compared with
related some existing protocols. The ARO is defined as the ratio of average number of
bytes of non-data packets broadcast by vehicles for the routing maintenance to the average
number of bytes of data packets received by the destinations. This metric reflects the extra
communication overhead introduced by the routing protocols. In addition, the complexity
of ARPRL is also analyzed.
6.1 Routing Overhead Analysis
In ARPRL, the non-data packets consist of two categories: (1) periodical proactive HELLO
packets; (2) On-demand reactive Learning Probe REQuest/REPly (LPREQ and LPREP)
Packets. Periodical HELLO packets are the main source of Routing Overhead (RO)
introduced by ARPRL, however, it is imperative by ARPRL for real time sensing of
network changes. The significant difference pertain to RO between ARPRL and other
existing protocols under consideration (except for QROUTING) is the dynamic variant part
of HELLO packet of ARPRL, which is used to exchange Q table information between
neighbors.
Figure 15a presents the ARO of each protocol with varying the number of vehicles. As
shown in Fig. 15a, AODV, QLAODV and GPSR remain almost similar constant for all
configurations of number of vehicles. This is because the ARO mainly depends on the
average bytes of broadcast of control packets. For AODV and QLAODV, it is determined
by the number of RREQ packets broadcast in the routing discovery process. This is why
the ARO of AODV and QLAODV increase slightly with the increase of number of
vehicles as the CBR data rare is fixed at one packet per second. For GPSR, fixed length of
periodic HELLO packets are the mainly source of RO for neighbors position maintenance.
Therefore, it’s ARO stays constant with the increase of number of vehicles. For
QROUTING, the ARO is determined by the variable length of periodic broadcast of
HELLO packets for routing learning, which increases linearly with the increase of number
of vehicles. For ARPRL, the LPREQ packets also contribute to part of ARO besides the
HELLO packets which are the same as that of QROUTING. This is why the ARO of
ARPRL and QOURING increases linearly with the increase of number of vehicles
meanwhile ARPRL has slightly more ARO than QROUTING at high density of vehicles.
In Fig. 15b, it can be observed that the ARO of AODV and QLAODV increases with
the increase of maximum allowable vehicle velocity. This is expected since the increase in
vehicle velocity causes more changing network topology and hence the number of triggers
of route discovery increases. In contrast, GPSR remain constant of ARO in all cases and
has lowest ARO in all five protocols. This is due to the length and number of periodic
HELLO packets in GPSR are independent of changing network topology. In general,
2166 J. Wu et al.
123
ARPRL and QROUTING have approximately the same ARO at low and medium velocity.
At high velocity, ARPRL has slightly more ARO than QROUTING due to the increase of
the number of broadcast of LPREQ packets.
Figure 15c shows that the ARO of AODV and QLAODV decreases with the increase of
packet interval since the number of route discovery is proportional to the number of
packets to transmit. Meanwhile, ARPRL, QROUTING and GPSR have almost constant of
ARO in all cases. This can be explained by the fact that periodic HELLO packets are
independent of data generation rate. At high data generation rate, ARPRL has slightly more
ARO than QROUTING due to the adoption of learning probe mechanism in ARPRL.
In summary, ARPRL shows higher routing overhead than the other four protocols as
shown in Fig. 15. This is expected since the combination of proactive routing learning
algorithm and reactive routing probe strategy inevitably causes higher overhead but
improves comprehensive performance. However, how to efficiently further reduce the
ARO of ARPRL will be considered as our future work.
6.2 Complexity Analysis
Through the above experimental results, we can conclude that ARPRL is more suitable in
VANET environment for higher data delivery success rate, lower delay and fewer routing
hops since a variety of optimizing strategies is adopted based on AODV and QLAODV.
(a) (b)
(c)
Fig. 15 Average routing overhead versus. a Number of vehicles. b Maximum velocity. c Packet interval
Reinforcement Learning Based Mobility Adaptive Routing for… 2167
123
However, it is also necessary to analyze the time and space complexity of A when applying
ARPRL to VANET. For ARPRL, the time complexity depends mainly on the maintenance
of the Q table, which consists of three parts. The time required for each part is O(1). For a
network with N vehicle nodes, it is clear that the time complexity of ARPRL is O(N). The
spatial complexity of ARPRL depends mainly on the memory space required to build the Q
table. In the network with N vehicle nodes, obviously, the space complexity of ARPRL is
OðN3Þ in the worst case, which is higher than other four protocols. Fortunately, however,
for VANET, this is acceptable because each vehicle can be equipped with a computing
device with enough high processing capability.
7 Conclusion
In this paper, we proposed ARPRL, a reinforcement learning based heuristic routing
protocol for VANETs. ARPRL employs Q Learning to dynamically learn the optimal
stable and reliable route through a variety of strategies to update the Q table maintained by
each vehicle node. Periodic exchange of HELLO messages between neighbour vehicles,
forwarding of DATA packets and MAC layer feedback mechanism are used to assist in the
updating of the Q table. In order to boost the convergence of learning process, LPREQ and
LPREP messages were used at the begin of learning process. We also designed the
structure of HELLO message for the exchange of optimal part of Q table contends and
avoid the occurrence of route loops at some extent. More importantly, we proposed a novel
Q value update function which takes into consideration the distinguish features of
VANETs. ARPRL forwards data packets according to the Q table which is updated
through Q learning algorithm and takes the number of hops, vehicle mobility and link
expired time into account, thus it performs better and is more suitable for packet loss and
delay sensitive applications.
Acknowledgements We would like to appreciate the editors and the anonymous reviewers for their helpfulcomments and suggestions. This work is supported by National Natural Science Foundation of China (GrantNo. 61472305), Aeronautical Science Foundation of China (Grant No. 20151981009) and Science ResearchProgram, Xian, China (Grant No. 2017073CG/RC036(XDKD003)).
References
1. Campolo, C., Molinaro, A., & Scopigno, R. (2015). Vehicular ad hoc networks standards, solutions andresearch. Berlin: Springer.
2. Li, F., & Wang, Y. (2007). Routing in vehicular ad hoc networks: A survey. IEEE Vehicular Tech-nology Magazine, 2(2), 12–22.
3. Yun-Wei, L. I. N., Yuh-Shyan, C. H. E. N., & Sing-Ling, L. E. E. (2010). Routing protocols in vehicularad hoc networks: A survey and future perspectives. Journal of Information Science and Engineering, 26,913–932.
4. Chen, W., Guha, R. K., Kwon, T. J., Lee, J., & Hsu, Y.-Y. (2011). A survey and challenges in routingand data dissemination in vehicular ad hoc networks. Wireless Communications and Mobile Computing,11(7), 787–795.
5. Zeadally, S., Hunt, R., Chen, Y.-S., Irwin, A., & Hassan, A. (2012). Vehicular ad hoc networks(VANETs): Status, results and challenges. Telecommunication Systems, 50(4), 217–241.
6. Survey, A., Sharef, B. T., Alsaqour, R. A., & Ismail, M. (2014). Vehicular communication ad hocrouting protocols. Journal of Network and Computer Applications, 40, 363–396.
7. Sutton, R. S., & Barto, A. G. (2011). Reinforcement learning: An introduction (Vol. 1). Cambridge:Cambridge Univ Press.
2168 J. Wu et al.
123
8. Kiumarsi, B., Lewis, F. L., Modares, H., Karimpour, A., & Naghibi-Sistani, M.-B. (2014). Rein-forcement Q-learning for optimal tracking control of linear discrete-time systems with unknowndynamics. Automatica, 50(4), 1167–1175.
9. Clausen, T., & Jaqcquet, P. (2003). Optimized link state routing (OLSR). IETF Networking Group,RFC, 3626, 1–75.
10. Alslaim, M. N., Alaqel, H. A, & Zaghloul, S. S. (2014). A comparative study of MANET routingprotocols. In 2014 third international conference on technologies and networks for development(ICeND) (pp. 178–182). IEEE.
11. Perkins, C., Belding-Royer, E., Das, S., et al. (2003). Ad-hoc on-demand distance vector (AODV)routing. IETF Networking Group, RFC, 3561, 1–38.
12. Beijar, N. (2002). Zone routing protocol (ZRP). Networking Laboratory, Helsinki University of Tech-nology, Finland, 9, 1–12.
13. Fonseca, A., & Vazao, T. (2013). Applicability of position-based routing for VANET in highways andurban environment. Journal of Network and Computer Applications, 36(3), 961–973.
14. Kumar, S., & Verma, A. K. (2015). Position based routing protocols in VANET: A survey. WirelessPersonal Communications, 83(4), 2747–2772.
15. Liu, J., Wan, J., Wang, Q., Deng, P., Zhou, K., & Qiao, Y. (2016). A survey on position-based routingfor vehicular ad hoc networks. Telecommunication Systems, 62(1), 15–30.
16. Goel, N., Sharma, G., & Dhyani, I. (2016). A study of position based VANET routing protocols. In 2016international conference on computing, communication and automation (ICCCA) (pp. 655–660). IEEE.
17. Kenichi, M. A. S. E. (2016). A survey of geographic routing protocols for vehicular ad hoc networks asa sensing platform. IEICE Transactions on Communications, 99(9), 1938–1948.
18. Karp, B., & Kung, H.-T. (2000). GPSR: Greedy perimeter stateless routing for wireless networks. InProceedings of the 6th annual international conference on mobile computing and networking (pp.243–254). ACM.
19. Sood, M., & Kanwar, S. (2014). Clustering in MANET and VANET: A survey. In 2014 internationalconference on circuits, systems, communication and information technology applications (CSCITA) (pp.375–380). IEEE.
20. Yang, P., Wang, J., Zhang, Y., Tang, Z., & Song, S. (2015). Clustering algorithm in VANETs: Asurvey. In 2015 IEEE 9th international conference on anti-counterfeiting, security, and identification(ASID) (pp. 166–170). IEEE.
21. Cooper, C., Franklin, D., Ros, M., Safaei, F., & Abolhasan, M. (2016). A comparative survey ofVANET clustering techniques. IEEE Communications Surveys & Tutorials, 19(1), 657–681.
22. Sucasas, V., Radwan, A., Marques, H., Rodriguez, J., Vahid, S., & Tafazolli, R. (2016). A survey onclustering techniques for cooperative wireless networks. Ad Hoc Networks, 47, 53–81.
23. Anupama, M., & Sathyanarayana, B. (2011). Survey of cluster based routing protocols in mobile ad-hocnetworks. International Journal of Computer Theory and Engineering, 3(6), 806.
24. Lin, C. R., & Gerla, M. (1997). Adaptive clustering for mobile wireless networks. IEEE Journal onSelected Areas in Communications, 15(7), 1265–1275.
25. Chatterjee, M., Das, S. K., & Turgut, D. (2002). WCA: A weighted clustering algorithm for mobile adhoc networks. Cluster Computing, 5(2), 193–204.
26. Jaiswal, S., & Adane, D. D. S. (2013). Hybrid approach for routing in vehicular ad-hoc network(VANET) using clustering approach. International Journal of Innovative Research in Computer andCommunication Engineering, 1(5), 1211–1219.
27. Kakkasageri, M. S., & Manvi, S. S. (2014). Connectivity and mobility aware dynamic clustering inVANETs. International Journal of Future Computer and Communication, 3(1), 5.
28. Boyan, J. A., Littman, M. L., et al. (1994). Packet routing in dynamically changing networks: Areinforcement learning approach. In Advances in Neural Information Processing Systems, pp. 671–678.
29. Dowling, J., Curran, E., Cunningham, R., & Cahill, V. (2005). Using feedback in collaborative rein-forcement learning to adaptively optimize MANET routing. IEEE Transactions on Systems andCybernetics-Part A: Systems and Humans, 35(3), 360–372.
30. Celimuge, W. U., Kumekawa, K., & Toshihiko, K. A. T. O. (2010). Distributed reinforcement learningapproach for vehicular ad hoc networks. IEICE Transactions on Communications, 93(6), 1431–1442.
31. Plate, R., & Wakayama, C. (2015). Utilizing kinematics and selective sweeping in reinforcementlearning-based routing algorithms for underwater networks. Ad Hoc Networks, 34, 105–120.
32. Santhi, G., Nachiappan, A., Ibrahime, M. Z., Raghunadhane, R., & Favas, M. K. (2011). Q-learningbased adaptive QoS routing protocol for MANETs. In 2011 international conference on recent trends ininformation technology (ICRTIT) (pp. 1233–1238). IEEE.
33. Royer, E. M., & Perkins, C. E. (2000). Multicast ad hoc on-demand distance vector (MAODV) routing.IETF Draft, 1, 10–25.
Reinforcement Learning Based Mobility Adaptive Routing for… 2169
123
34. Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming.Hoboken: Wiley.
35. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1). Cambridge:MIT Press.
36. SNT. (2014). QualNet 7.1. http://web.scalable-networks.com.37. Harri, J., Fiore, M., Filali, F., & Bonnet, C. (2011). Vehicular mobility simulation with VanetMobiSim.
Simulation, 87(4), 275–300.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institu-
tional affiliations.
Jinqiao Wu received the MS degree in 2014 from the Xi’an Universityof Post & Telecomunications, Xi’an, China. He is currently a Ph.D.candidate in computer science at Xidian University, Xi’an, China. Hisresearch interests include machine learning, networking architectures,and routing protocols.
Min Fang received her B.S. degree in computer control, M.S. degreein computer software engineering and Ph.D. degree in computerapplication from Xidian University, Xi’an, China, in 1986, 1991 and2004, respectively, where she is currently a professor. Her researchinterests include intelligent information process, multi-agent systemand network technology.
2170 J. Wu et al.
123
Xiao Li received BS degree from Xi’an University of Finance andEconomics, Xi’an, China, in 2012. She is currently a Ph.D. candidatein computer science at Xidian University, Xi’an, China. Her researchinterests include pattern recognition, machine learning and computervision.
Reinforcement Learning Based Mobility Adaptive Routing for… 2171
123