Reinforcement Learning Based Mobility Adaptive Routing for...

29
Reinforcement Learning Based Mobility Adaptive Routing for Vehicular Ad-Hoc Networks Jinqiao Wu 1 Min Fang 1 Xiao Li 1 Published online: 7 May 2018 Ó Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract Vehicular ad-hoc networks (VANETs) is drawing more and more attentions in intelligent transportation system to reduce road accidents and assist safe driving. However, due to the high mobility and uneven distribution of vehicles in VANETs, multi-hops communication between the vehicles is still particularly challenging. Considering the distinctive characteristics of VANETs, in this paper, an adaptive routing protocol based on reinforcement learning (ARPRL) is proposed. Through distributed Q-Learning algorithm, ARPRL constantly learns and obtains the fresh network link status proactively with the periodic HELLO packets in the form of Q table update. Therefore, ARPRL’s dynamic adaptability to network changes is improved. Novel Q value update functions which take into account the vehicle mobility related information are designed to reinforce the Q values of wireless links by exchange of HELLO packets between neighbor vehicles. In order to avoid the routing loops caused in Q learning process, the HELLO packet structure is redesigned. In addition, reactive routing probe strategy is applied in the process of learning to speed up the convergence of Q learning. Finally, the feedback from the MAC layer is used to further improve the adaptation of Q learning to the VANETs environment. Through simulation experiment result, we show that ARPRL performs better than existing protocols in the form of average packet delivery ratio, end-to-end delay and number hops of route path while network overhead remains within acceptable ranges. Keywords VANET Adaptive routing Reinforcement learning Q-Learning & Min Fang [email protected] Jinqiao Wu [email protected] Xiao Li [email protected] 1 School of Computer Science and Technology, Xidian University, No. 2, South Taibai Street, Xi’an 710071, Shanxi, People’s Republic of China 123 Wireless Pers Commun (2018) 101:2143–2171 https://doi.org/10.1007/s11277-018-5809-z

Transcript of Reinforcement Learning Based Mobility Adaptive Routing for...

Page 1: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Reinforcement Learning Based Mobility AdaptiveRouting for Vehicular Ad-Hoc Networks

Jinqiao Wu1 • Min Fang1 • Xiao Li1

Published online: 7 May 2018� Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract Vehicular ad-hoc networks (VANETs) is drawing more and more attentions in

intelligent transportation system to reduce road accidents and assist safe driving. However,

due to the high mobility and uneven distribution of vehicles in VANETs, multi-hops

communication between the vehicles is still particularly challenging. Considering the

distinctive characteristics of VANETs, in this paper, an adaptive routing protocol based on

reinforcement learning (ARPRL) is proposed. Through distributed Q-Learning algorithm,

ARPRL constantly learns and obtains the fresh network link status proactively with the

periodic HELLO packets in the form of Q table update. Therefore, ARPRL’s dynamic

adaptability to network changes is improved. Novel Q value update functions which take

into account the vehicle mobility related information are designed to reinforce the Q values

of wireless links by exchange of HELLO packets between neighbor vehicles. In order to

avoid the routing loops caused in Q learning process, the HELLO packet structure is

redesigned. In addition, reactive routing probe strategy is applied in the process of learning

to speed up the convergence of Q learning. Finally, the feedback from the MAC layer is

used to further improve the adaptation of Q learning to the VANETs environment. Through

simulation experiment result, we show that ARPRL performs better than existing protocols

in the form of average packet delivery ratio, end-to-end delay and number hops of route

path while network overhead remains within acceptable ranges.

Keywords VANET � Adaptive routing � Reinforcement learning � Q-Learning

& Min [email protected]

Jinqiao [email protected]

Xiao [email protected]

1 School of Computer Science and Technology, Xidian University, No. 2, South Taibai Street,Xi’an 710071, Shanxi, People’s Republic of China

123

Wireless Pers Commun (2018) 101:2143–2171https://doi.org/10.1007/s11277-018-5809-z

Page 2: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

1 Introduction

Vehicular ad-hoc networks (VANETs) [1] is a specific type of mobile ad hoc networks

(MANETs) with the aim of providing vehicle-to-vehicle (V2V) and vehicular-to-infras-

tructure (V2I) communications in order to reduce the traffic congestion and avoid road

accident. For V2V communications, vehicles usually communicate with each other in the

nonexistence of Road-Side Units (RSUs) and hence multi-hops data transmission in

VANETs is still a quite challenging issue and an efficient routing algorithm adapting to the

VANETs environment is necessary.

Routing in VANETs [2–6] is considered as one of the most important processes, which

allows some special applications designed for VANETs users to provide them with specific

services. Routing in VANETs is to select the optimal paths from the source vehicle to the

designation vehicle through a set of intermediate vehicles. In order to support reliable and

real time transmission of messages in some cases of emergency, the routing protocol

should forwards data packets with high reliability and low delay. However, due to the

frequent changing topology caused by high mobility of vehicles, the existing traditional

routing protocols become increasingly unreliable or even fail, which results in a delayed

reach or even loss of data packets. Therefore, it is imperative to design a robust routing

protocols for VANETs which guarantee high efficiency of communications between

vehicle nodes.

Considering the distinct characteristics of VANETs, such as high-speed movement of

vehicle nodes, fast network topology change, uneven distribution of vehicles and short link

connection duration time, the traditional dynamic adaptive routing protocol designed for

MANETs is no longer suitable for VANETs network environment. Therefore, without

significantly increasing the routing control information interaction between vehicles, it is

difficult for the vehicle nodes in VANETs to perceive the continuously change of the

network topology in time and make a reasonable corresponding optimal routing decision

and self-configuration via existing routing strategy. Reinforcement learning is one of the

ways to solve this problem. Designing such a new efficient and effective routing protocol

for VANETs based on reinforcement learning, which takes into account the vehicles’

mobility related information, can enhance the adaptability of the routing protocol to

VANETs.

Reinforcement learning [7] is increasingly applied to deal with dynamic routing

problems. Q-Learning [8] algorithm is one of the most commonly used form of rein-

forcement learning algorithms which can achieve optimal decisions through the continuous

interaction with the environment without having to know the environment model in

advance. By periodicly exploration of the environment, the agents will eventually obtain

the optimal mapping from environment states to available actions in these states. For the

dynamic routing problem in VANETs, the whole VANETs can be regarded as the envi-

ronment. Each vehicle in the VANETs can be modeled as the agent. The process of packet

forwarding which each vehicle participates in can be considered as the interaction between

the vehicle nodes and its corresponding network. Each packet forwarding, whether control

packet or user data packet, means the regaining of the newest state of network.

In this paper, in order to adapt to the rapid mobility of vehicle nodes in VANETs, we

propose ARPRL, an adaptive routing protocol based on reinforcement learing. ARPRL

takes into account the position related information, such as vehicle position, relative

velocity and direction, to learn the optimal path between the source and destination

vehicles. To respond quickly to the rapid change of topology, each vehicle continuously

2144 J. Wu et al.

123

Page 3: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

updates its Q table by periodically sending and receiving control messages. This funda-

mental aspect is referred to as learning through control messages. The most common form

of this kind of control packets in the existing literature is HELLO messages for neighbor

maintenance in most proactive routing protocols, such as optimized link-state routing

(OLSR) [9]. To accurately detect path break in the ongoing data transmission process,

another auxiliary learning aspect, referred to as learning through DATA Packets, is

extremely crucial for efficient data packet routing in highly dynamic environment, such as

VANETs. The last aspect, known as learning through feedback signal, is also to be con-

sidered for reliable data dissemination in VANETs. The feedback signal is mainly provided

by the link layer, such as IEEE 802.11 MAC.

The main contributions of this paper are as follows:

1. A novel mobility adaptive routing protocol suitable for VANET environment based on

distributed Q-Learning algorithm is proposed, in which each vehicle proactively learns

the network status through Q-Learning algorithm to further improves the dynamic

adaptability of the protocol. In order to enhance the efficiency of Q-Learning, The

periodical broadcast of redesigned HELLO packet, forwarding of user DATA and

notification of MAC layer packet loss are considered as trigger source of Q

table update.

2. A routing learning probe approach is adopted to speed up the convergence of Q

learning. Accordingly, the routing delay is reducted. The packet forwarding process is

also contributed to the update of the Q table, which further improves the dynamic

adaptability of the proposed protocol.

3. A new HELLO package structure is designed to avoid the generation of routing loops

in the learning process. Consequently, routing hops are optimized and comprehensive

performance is improved.

The remainder of the paper is organized as follows. Section 2 investigates related state

of the art. Reinforcement learning and Q-Learning model for routing problem is introduced

in Sect. 3. Section 4 gives elaborated description of the proposed protocol ARPRL.

Simulation results are presented in Sect. 5. Section 6 analyzes the complexity of the

protocol, which is followed by conclusions in Sect. 7.

2 Related Work

In VANETs, each vehicle moves along the roads. As a result, the V2V communications are

highly susceptible to frequent link breaks. To solve this problem, various routing protocols

have been proposed in recent decays.

The most intuitive way to resolve the routing problem in VANETs is to apply the

exiting routing protocols which are designed for mobile ad hoc networks (MANETs) [10].

In MANETs, routing protocols can be classifed into two main categories according to the

routing discovery criterion. The first is topology-based routing which can be further sub-

divided into proactive and reactive routing. This type of routing is represented by the

OLSR and the ad hoc on-demand distance vector (AODV) [11]. In OLSR, HELLO

messages need to be sent periodically to detect the joining and leaving of neighboring

nodes and routing information is also need to be exchanged periodically between neigh-

bour nodes to obtain the global network topology. More importantly, regardless of whether

the node needs to send data, each node maintains the routing path information for each

Reinforcement Learning Based Mobility Adaptive Routing for… 2145

123

Page 4: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

other nodes in the network. In AODV, routing information is updated on demand. How-

ever, AODV needs to flood RREQ messages throughout the entire network to search a new

routing path once the old one is interrupted, which will cause considerable routing over-

head in VANETs. In addition, the AODV will not switch to the suboptimal path or to

preemptively reestablish a new one until the current active route is not available. The

second is hybrid ad-hoc routing, which is represented by the Zone Routing Protocol (ZRP)

[12]. Unfortunately, for the same reason as the AODV and OLSR mentioned above, ZRP is

also not be suitable for routing packets in VENETs.

Some other routing protocols, which are often referred to as geographic-based routing

[13–17], rely on the location information of neighbor nodes for packet forwarding, e.g.

Greedy Perimeter Stateless Routing (GPSR) [18] , which does not need to establish a

routing table for packet forwarding. Unlike topology-based routing, GPSR always forwards

the packet in the direction closest to the destination node and does not need to send any

routing control packets. However, GPSR relies on accurate nodal location information and

is also prone to generate route loops caused by node high movement.

Some cluster-based routing protocols have been proposed to address VANETs routing

problems [19–22]. Most clustering routing algorithms designed for VANETs are originated

from MANET. However, the clustering based routing protocol [23] applicable to MANET

may not satisfy the dynamic characteristics of VANETs. LID (Lowest ID clustering

algorithm) [24] is a simple clustering algorithm proposed by Grela and Tsai et al. Each

node is assigned a unique identifier (ID) across the entire network. The cluster header node

is prefered with the node with the smallest ID. The disadvantage of LID is that the cluster

head node may become the system performance bottleneck if the time it serves as the

cluster head role is too long. Distributed Clustering algorithm (DCA) [25] selects the

cluster head based on the nodes weight. The weight may be a function of the node

transmission range or the node mobility factor. A hybrid clustering routing approach [26]

for VANETs has been proposed to achieve dynamic routing on the basis of vehicle ID,

vehicle location ID and vehicle lifetime. Another dynamic clustering routing [27] for

VANETs is proposed based on the vehicle connectivity degree and mobility metrics. The

proposed scheme considers the vehicles on a specific lane between two junctions to form

dynamic and stable clusters.

In recent years, reinforcement learning is increasingly applied in dynamic routing.

Boyan and Littman proposed a QRouting algorithm [28] for a irregularly connected wired

network composed of 36 nodes. Dowling et al. [29] have proposed a routing protocol

SAMPLE for MANET based on collaborative reinforcement learning. Unfortunately,

SAMPLE does not take into account the frequent link break in MANET. In SAMPLE,

similar to the DSR, the routing information to be broadcast is added to the data packet

header, hence it is not suitable for heavy data packet traffic applications. Based on existing

AODV, Celimuge WU et al. proposed an improved Q learning routing protocol QLAODV

[30] (Q-Learning AODV) to efficiently deal with routing issue in highly dynamic network

such as MANET. To address the slow convergence caused by Q learning algorithms, Plate

Randall et al. presented a Q-learning based routing approach QKS [31] (Q-learning uti-

lizing Kinematics and Sweeping) associated with kinematic and sweeping features for

underwater network. Santhi et al. proposed a MANET multicast routing protocol

QLMAODV [32] (Q-Learning MAODV [33]) by applying Q learning algorithm to the

existing MAODV protocol. Based on distributed Q learning, QLMAODV learns the net-

work status information and improves the performance of MAODV by preemptively

choosing a sub-optimal route before the current active route becomes invalid. However,

QLMAODV is designed for MANET and solves multicast routing problem.

2146 J. Wu et al.

123

Page 5: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

3 Reinforcement Learning

3.1 Markov Decision Process

Reinforcement Learning is an efficient approach to solve the sequential decision task,

which can be represented as a Markov Decision Process (MDP) process [34]. Generally, a

MDP process contains the following: (a) A set of discrete environment states S; (b) A set of

discrete actions A available for agents in a specific state s; (c) An environment model

Tðs; a; s0Þ (s; s0 2 S and a 2 A); and (d) A reward function Rðs; a; s0Þ. A policy pðsÞ gives

selection of an action in state s. A MDP searches for the optimal policy p�ðsÞ, which

maximizes the expected sum of rewards. The expected sum of rewards is the accumulated

discounted rewards from initial state. Let VpðsÞ denotes the value functions of the state

s 2 S, which can be formulated as:

Vp sð Þ ¼ Ep Rt st¼sjf g

¼ Ep

X1

k¼0

ckrtþk st¼sj( )

¼ Ep rt þ crtþ1 þ � � �f g

ð1Þ

where rt, Ep rtf g and Ep rtþ1f g [see (4)] are defined as:

rt ¼ r s; að Þ s¼st ;a¼at��

¼X

stþ12SPatststþ1

Ratststþ1

ð2Þ

Ep rtf g ¼ Ep r st; atð Þf g¼X

at2Ap st; atð Þ rt ð3Þ

Ep rtþ1f g ¼ Ep r stþ1; atþ1ð Þf g

¼X

at2Ap st; atð Þ

X

stþ12 S

Patststþ1

X

atþ12Ap stþ1; atþ1ð Þrtþ1

" #�����st¼s

8<

:

9=

;ð4Þ

here Patststþ1

satisfies:

X

s02SPass0 ¼ 1

in which

Pass0 ¼ Pr stþ1 ¼ s0 st¼s;at¼a

��� �

In (4), at denotes an action executed in the state st at time t; Patststþ1

denotes the prob-

ability of transforming to the next state stþ1 from the current state st when taking action at.

Ratststþ1

is the immediate reward for taking the action at at state st and arriving at state stþ1.

From Bellman’s optimality rule in Dynamic Programming, once the state transition and

value function are known, the optimal solution can be obtained. Hence, substituting (4) into

(1), we can obtain the optimal value function in the form of Bellman equations as follows:

Reinforcement Learning Based Mobility Adaptive Routing for… 2147

123

Page 6: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

V� sð Þ ¼ maxp

VpðsÞ

¼ maxp

Ep

X1

k¼0

ckrtþk

�����st¼s

8<

:

9=

;

¼ maxp

Ep rt þX1

k¼1

ckrtþk

�����st¼s

8<

:

9=

;

¼ maxa

rt þ cX

stþ12SPaststþ1

Vpðstþ1Þ�����st¼s

24

35

ð5Þ

where v�ðsÞ denotes the optimal value in state s and p is a mapping from states set S to

actions set A.

3.2 Q-Learning

Based on (5), the value function can also be defined as the function of state-action pairs

that estimates the quality of it when performs a an action a in a state s, which is named

Q function and can be denoted as Q(s, a). Then (5) can be revised as below when applying

a policy p:

Qp s; að Þ ¼ Ep Rtjst¼s; at¼an o

¼ Ep

X1

k¼0

ckrtþk

�����st¼s; at¼a

8<

:

9=

;

¼ Ep rt þX1

k¼1

ckrtþk

�����st¼s; at¼a

8<

:

9=

;

¼ rt þ cX

stþ12SPaststþ1

Vpðstþ1Þ !�����

st¼s

ð6Þ

In order to make V�ðsÞ does not depend on a specific policy p, we define:

V� sð Þ ¼ maxa

Q� s; að Þ ð7Þ

Then substituting (7) into (6), we can get:

Q� st; atð Þ ¼ rt þ cX

stþ12SPaststþ1

maxa

Q�ðstþ1; aÞ ð8Þ

However, in practice, the environment model is not known a priori. In such circum-

stances, the optimal value functions can be obtained through TD (temporal difference) and

MC (Monte-Carlo) RL algorithms [35], which are more suitable to get optimal policy.

Q-Learning is well-known as a TD algorithm, which works by evaluating the stateaction

pair values through interactions with the environment without knowing its model in

advance. Based on the above analysis, Eq. (8) can be written as:

2148 J. Wu et al.

123

Page 7: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Q st; atð Þ 1� að ÞQ st; atð Þ

þ a rt þ cmaxa

Q stþ1; að Þh i ð9Þ

where ð0� a� 1Þ is the learning rate, which limits how rapidly learning process can

proceed. The rt is the immediate feedback from the environment. The discount factor

ð0� c� 1Þ determines how important of the future Q-values are.

3.3 Routing Based on Q-Learning

Based on Q-Learning, the routing problem can be modeled as follows. The entire network

can be modeled as the environment. Packets forwarding by the intermediate nodes cor-

responds to the interactions process in Q-Learning. Each packet denotes an agent. Each

node can be considered as a one state for the packet when it is forwarded in the network.

All the network nodes composed of the states space for all forwarding packets. The next

hop selection from one-hop neighbors can be regarded as an action. Therefore the available

set of actions at a node for a packet is its one-hop neighbors. Obviously, the state tran-

sitions can be mapped to the forwarding of a packet from one node to its one neighbor.

However, the learning task must be done in a distributed way for each node since a global

view of network state is impossible. Once the optimal next hop for a outgoing packet is

determined, the node will get back the reward.

According to (9), combined with the routing problem, we can intuitively revise it as:

Qs d; xð Þ 1� að ÞQs d; xð Þ

þ a Rþ c � maxy2N xð Þ

Qx d; yð Þ� � ð10Þ

where Qsðd; xÞ denotes the Q value of source node s for destination node d through its

neighbor x. N(x) denotes one-hop neighbors of x and R is the instant reward for s to forward

the packet to d through x. a is learning rate and it controls the rate of learning task. The

higher the value of a, the faster the Q value is updated and the better the adaptability to the

dynamic characteristics of the network. However, if a is too large, the rewards will mislead

the packet forwarding because agents can learn the incorrect immediate rewards in some

cases. c is the discount factor and it denotes the importance of future rewards. If c is too

low, immediate rewards are dominated, while higher values of c will lead to overly count

on future rewards.

4 The Proposed Protocol

In this section, we provide an elaborated description of ARPRL. ARPRL is designed for

VANETs and aims to find a optimal route between the source and destination. ARPRL

employs distribute Q-learning algorithm to learn the best multi-hop route considering the

distinguish characteristics of VANETs.

4.1 Assumptions

In the description of the ARPRL algorithm, we make the following network operational

assumptions: Each vehicle has a unique ID which is range from 1 to N, where N denotes

Reinforcement Learning Based Mobility Adaptive Routing for… 2149

123

Page 8: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

the number of vehicles. Each vehicle also has the location and velocity information of itself

through GPS and their one-hop neighbors via periodically HELLO message. MAC layer’s

packet loss will be fed back to the network layer. Each node maintains two tables: one is a

neighbor table and the other is a Q table. The neighbor table stores neighbor dynamic

information and is updated through HELLO message. The Q table is used to route packets

and is updated through HELLO, DATA forwarding and MAC layer’s feedback signal.

Each vehicle periodically exchanges optimal part of Q table information with its neighbor

nodes. Every vehicle is aware of the joining and leaving of its neighbors by HELLO timer.

4.2 ARPRL Protocol Overview

In ARPRL, each vehicle periodically exchanges HELLO packets with its neighbors to

learn real-time changes in network topology. This is one of the most important ways of

dynamic learning. On receiving of a HELLO message, each vehicle updates its Q

table according to the information contained in the HELLO packet. Each HELLO packet

includes the sender vehicles position, velocity and some Q values extracted from Q table.

When a source vehicle has a packet to send to a destination vehicle, it first searches its Q

table for a valid next hop. If none exists, ARPRL initiates route probe process, which

closely resembles to that of AODV, to reactively learn an optimal route to the destination.

Each vehicle in the network acts as a learning agent and continuously gathers network link

state information through interacting with its neighbor vehicles by exchanging of HELLO

packets. Data packets are forwarded to the neighbor vehicle with maximal Q value. Since

the neighbor is determined by the Q value and the Q table is updated upon periodic

exchange of HELLO packets, DATA forwarding and MAC layer’s feedback signal, the

neighbor selected as the next hop is always the best.

4.3 Route Probe for Boosting Convergence of Q-Learning

Convergence is a key issue for Q-Learning. Therefore, at the beginning of the learning

process, a proactively learning mechanism is adopted as a supplement to speed up the

Q-Learning convergence. When a route to a destination vehicle is needed, the source

vehicle broadcasts a learning probe request (LPREQ) packet. Upon the first receiving of the

LPREQ packet, each intermediate vehicle rebroadcasts the LPREQ packet and update

corresponding Q value until it reach to the destination vehicle. Then the destination vehicle

unicasts a learning probe reply (LPREP) packet to the source vehicle by consulting the just

learned Q table. Algorithm 1 presents the pseudocode for route probe process.

2150 J. Wu et al.

123

Page 9: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Reinforcement Learning Based Mobility Adaptive Routing for… 2151

123

Page 10: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

The following example illustrates the route probe process through the LPREQ and

LPREP packets. In the segment of a road network as shown in Fig. 1, each vehicle

maintains a Q Table which consists of Q-values Q(d, x), where d is the destination vehicle

and x is the next hop to d. As illustrated in Fig. 1, all of the Q-Table values are initialized

to 0 except the one where the destination and next hop is equal to the current vehicle. Take

the Q table of source vehicle S in Fig. 1 as an example, the QsðS; SÞ is set to 1, which is

marked by dark green cell as indicated in Fig. 1, while the other values are set to 0. When

S initially send packets to D, S first broadcasts an LPREQ message. On receiving of the

LPREQ message, the V1’s Q table value QV1ðS; SÞ is updated according to (10). For the

sake of simplicity, here the learning rate a and discount factor c in (10) is equal to 0.9 and

1.0, respectively. The constant reward R is equal to 100 if the LPREQ message is broadcast

by S, otherwise R is set to 0. It is to be noted that each row in the Q table represents the

destination node and each column denotes the next hop. In addition, there is a specific row,

which the destination node is equal to itself in the Q table, is used as route flag row.

Therefore QV1ðS; SÞ can be calculated according to (10) as follows:

QV1ðS; SÞ ¼ ð1� 0:9Þ � 0þ 0:9 � f100þ 1 � 0g ¼ 90

After QV1ðS; SÞ is updated, the corresponding route flag QV1

ðV1; SÞ is set to 1. Since V1

is not the destination vehicle node, it needs to continue to rebroadcast the LPREQ message

to its neighbors V2 and V3. Then V2 and V3 update their respective Q table as V1. However,

unlike V1, V2 and V3 will update QV2ðS;V1Þ and QV3

ðS;V1Þ in addition to update

QV2ðV1;V1Þ and QV3

ðV1;V1Þ as V1. QV2ðS;V1Þ and QV3

ðS;V1Þ can be calculated from (10)

as follows:

QV2ðs;V1Þ ¼ ð1� 0:9Þ � 0þ 0:9 � f0þ 1 � 90g ¼ 81

QV3ðS;V1Þ ¼ ð1� 0:9Þ � 0þ 0:9 � f0þ 1 � 90g ¼ 81

Then, the corresponding route flagQV2ðV2; SÞ, QV2

ðV2;V1Þ, QV3ðV3; SÞ and QV3

ðV3;V1Þ is

also be set to 1. Eventually, the LPREQ message successfully reach to D and the updated Q

Fig. 1 A segment of road network and each vehicle’s initial Q-Table

2152 J. Wu et al.

123

Page 11: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

table in each vehicle is shown as Fig. 2, in which the light blue cell indicates the previous

step Q table status ; the dark green and yellow cell indicates the new marked route flag and

Q value that have learned through the LPREQ message. It is worth noting that, for the

QDðS;V3Þs update, obviously from Fig. 2, we can have:

max QV3S; yð Þ

y2N V3ð Þ¼ QV3

S;V1ð Þ

In addition, QV2ðS;V1Þ is contained in the LPREQ message which will be broadcast by

V3. Thereafter upon receiving of the LPREQ message sent by V3, D will extract QV3ðS;V1Þ

from the LPREQ message and update QDðS;V3Þ as follows:

QDðS;V3Þ ¼ ð1� 0:9Þ � 0þ 0:9 � f0þ 1 � 81g ¼ 72:9

Upon receiving of the LPREQ message, D will responds with a LPREP packet back to

S. The LPREP message is backward routed in the intermediate vehicles according to the

learned knowledge. Upon receiving of the LPREP message, the intermediate vehicles

update corresponding Q value according to (10). Then a path between S and D is dis-

covered through the LPREQ and LPREP messages. The updated Q table values of each

vehicle are shown in Fig. 5. According to the Q table, the new discovered path from S to

D is S! V1 ! V3 ! D, as indicated in Fig. 3.

4.4 Format Design and Handle of HELLOMessage for Route Loop Reduction

The structure of HELLO packets is depicted in Fig. 4. The information contained in a

HELLO message include: ID, position, velocity, create time and an array of QMax. Each

QMax item consists of three fields: destination vehicle, Q value and next hop. The Nex-

tHop field is the key to avoid single hop route loop. Algorithm 2 presents the pseudo code

for HELLO message processing.

Fig. 2 Q Table after Broadcast of LPREQ from S

Reinforcement Learning Based Mobility Adaptive Routing for… 2153

123

Page 12: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Fig. 3 Q Table after Unicast of LPREP from D to S

Fig. 4 HELLO Construction according to Q table

2154 J. Wu et al.

123

Page 13: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Each vehicle maintains a HELLO timer which is triggered once in a HELLO_IN-

TERVAL time and constructs a new HELLO message according to the Q table and then

broadcast it when the HELLO timer is timeout. Take V3 as an example, the contents of its

Q table and the corresponding HELLO message are shown in Fig. 4. For the timestamps

field, its value is 39,854 ms which means the broadcast time of the HELLO message.

When receiving a HELLO message, each vehicle will update its Q table according to

the information contained in the HELLO message. The Q values for a specific neighbor

will be reset to 0 if not receiving a HELLO message from a neighbor for a certain time. For

example, the Q values QV3ðS;V1Þ and QV3

ðS;V1Þ is set to 0 when V1 gone out of the range

of V3. Furthermore, the routing flag QV3ðV3;V1Þ is also set to 0 accordingly.

Reinforcement Learning Based Mobility Adaptive Routing for… 2155

123

Page 14: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

4.5 Q-Table Maintenance Considering the Characteristics of VANETs

The update of the Q table is mainly through the periodic HELLO packets exchange

between neighbor vehicle nodes as discussed in Sect. 4.4. In addition, the Q Table is also

updated on the reception of LPREQ and LPREP messages for boosting the convergence of

Q Learning algorithm in the route probe process as discussed in Sect. 4.3. Furthermore, the

feedback or acknowledgment mechanism provided by MAC layer is also used to further

assist on the Q table update of hysteresis caused by fast mobility of vehicle nodes.

Considering the specific characteristics of the VANET, we designed a dynamic Q

table update strategy based on (10). Upon receiving of a HELLO message from a neighbor

vehicle n, current vehicle c will update its Q table as follows:

Qc d; nð Þ ¼ ð1� ac;nÞQc d; nð Þ þ ac;n Rc;n þ cc;n � maxy2NeiðnÞ

Qnðd; yÞ� �

ð11Þ

in which, Rc;n is defined as:

Rc;n ¼ C þ HMRRc;n þ LETc;n ð12Þ

where C is a constant with a value of 100. ac;n and cc;n are defined respectively as:

ac;n ¼ max 0:2;vcj j � vnj jj j

vmax � vmin

� �ð13Þ

cc;n ¼PN

n¼1 Rc;n

N; N 6¼ 0

0; N ¼ 0

8<

: ð14Þ

HMRRc;n (Hello Message Reception Ratio) is defined as:

HMRRðc; nÞ ¼100 � CNTrðc; nÞ

CNTsðnÞ; CNTsðnÞ� 15

100 � CNTrðc; nÞCNTsðnÞ

� 1� 1

2

� �CNTsðnÞ !

; other

8>>><

>>>:

where CNTrðc; nÞ and CNTsðnÞ denote the number of receiving and sending hello messages

at c from/to one-hop neighbor n, respectively. Here, we distinguish those neighbors

according to whether the neighbor duration time is less than 15s (in case of CNTsðnÞ\15).

LETc;n is defined as (15)

LETc;n ¼100 A¼ 0 and B¼ 0

min 100;�ðABþCDÞþ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA2þC2ð ÞR2�ðAD�BCÞ2

q

A2þB2

0@

1A otherwise

8>><

>>:

ð15Þ

where

2156 J. Wu et al.

123

Page 15: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

A ¼ vc cos hvcð Þ � vx cosðhvxÞB ¼ xc � xn

C ¼ vc sin hvcð Þ � vx sinðhvxÞD ¼ yc � yn

ð16Þ

Upon forwarding a DATA packet which is originated from s from neighbor n, c will

update the corresponding Q table values as follows:

Qc s; nð Þ ¼ ð1� ac;nÞQc s; nð Þ þ ac;n Rc;n þ cc;n � maxy2NeiðnÞ

Qnðs; yÞ� �

ð17Þ

Upon receiving of a MAC layer notification of packet loss from neighbor n, c will

update the corresponding Q table values for each destination di as follows:

Qc di; nð Þ ¼ 0:5 � Qc di; nð Þ ð18Þ

From (11), it should be noted that the link with maximum LET is regarded to be most

stable. To demonstrate the advantage of LET, Fig. 5 shows an intersection scenario in

VANETs. As shown in Fig. 5, The communication between vehicles S and D is possible

through two optional routes: one is via A (Route1: S! A! B! D) and the other via

C (Route2: S! C ! E! D). Since vehicle A is becoming farther and farther away from

S, while vehicle C is continuing straight as S, Route1 is likely to be disconnected after a

certain time due to the neighbor link break (S0 ! A0). Consequently, the neighbor C of S is

more suitable to be selected as the next hop to the path between vehicle S and D.

5 Experiment Results

To conduct the performance evaluation of our proposed protocol ARPRL, we implement it

in network simulator QualNet 7.1 [36]. To compare the performance of ARPRL with that

of AODV [11], QLAODV [30], QROUTING [28] and GPSR [18], we also implement

three other routing protocols(QLAODV, QROUTING, GPSR). In the following section,

we give the performance metrics used to evaluate the routing protocols performance, the

simulation parameters and the analysis of the corresponding results.

5.1 Metrics

We assess the protocols performance by varying the number of vehicles, maximum

velocity and CBR data generation interval in a predefined fixed Manhattan scenario area.

The performance metrics are the following:

Average Packet Delivery Ratio (APDR) This metric is defined as the ratio of the average

number of packets that are successfully received by the destination vehicles to the average

number of packets sent out by the source vehicles. The average packet delivery ratio metric

shows the ability of transferring application traffic data between the source and destination.

Average End-to-End Delay (AEED) This metric is defined as the average time taken for

packets to be successfully transmitted from their source to their destination. The average

end-to-end delay metric indicates the timeliness of the routing protocols transmitting

packets from source to destination.

Reinforcement Learning Based Mobility Adaptive Routing for… 2157

123

Page 16: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Average Hops Count (AHC) This metric is defined as the average number of interme-

diate nodes through which the successfully delivered packets have passed between the

source and the destination. The average hops count metric indicates the severity of traffic

duplication.

5.2 Simulation Setup

In the Manhattan simulation scenario, we use 20 horizontal and 20 vertical streets in the

2000 m 9 2000 m field which forms multiple 500 m 9 500 m grids layout with 25

intersections. For each street, it has 2 lanes in both directions. The vehicles move according

to the instructions of traffic light deployed at the intersections with 30s yellow interval of

signal. We use the VanetMobiSim [37], a well-known and validated framework for

vehicles mobility modeling at both macroscopic and microscopic levels,to generate the

movement of vehicles. The first 1000s of VanetMobiSim output was ignored to reflect

more accurate real movements of vehicles. Since we are focused on the routing protocol

performance and the IEEE 802.11p PHY/MAC modules which are standardized by IEEE

specifically for vehicular communication are not available for QualNet 7.1, we adopt IEEE

802.11a as the lower layers protocol. Other parameters are the default settings of QualNet

7.1, except for those which are shown in Table 1.

Fig. 5 An intersection scenario considering LET in VANET

2158 J. Wu et al.

123

Page 17: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

5.3 Simulation Results

5.3.1 Performance for Varying Number of Vehicles

The density of the vehicle nodes in VANETs has significant effect on the protocol per-

formance. In this part, we firstly fix the maximum velocity of the vehicle nodes to 15 m/s.

The CBR packet interval is fixed to 1s. The number of vehicles is varying from 50 to 350 to

indicate different vehicle density. The statistic results are described below.

Figure 6 shows the average packet delivery ratio (APDR) of each protocol with varying

the number of vehicles. From Fig. 6, we can see that the APDR of all the five protocols

Table 1 Simulation settingsParameter Value

Simulator QualNet(v7.1)

Simulation time 900 s

Simulation area 2000 m 9 2000 m

Number of vehicles 50, 100, 150, 200, 250, 300, 350

Minimal vehicle velocity 0 m/s

Maximal vehicle velocity 1, 5, 10, 15, 20, 25, 30 m/s

Transmission range 250 m

Number of CBR flows 20 (randomly selected)

CBR packet interval 0.1, 0.2, 0.6, 1, 2, 4, 6 s

CBR packet size 512 bytes

MAC protocol IEEE 802.11a

Channel frequency 5.885 GHz

Channel data rate 6 Mbps

Propagation model Two Ray

Fig. 6 Average packet delivery ratio versus number of vehicles

Reinforcement Learning Based Mobility Adaptive Routing for… 2159

123

Page 18: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

increases as the vehicle density increases when the number of nodes is less than 350 (for

QROUTING it is 300). This is because the connectivity of the network increases as the

node density increases. However, the APDR decreases slightly with the vehicle density

increases when the number of vehicles is higher than 300 (for QROUTING it is 250). The

reason is that the higher the node density, the greater the possibility of channel collisions.

In general, ARPRL outperforms all of the other protocols because of that ARPRL con-

siders the link reliability and vehicle mobility in the dynamic Q learning process. GPSR

only rely on the location of the neighbor vehicles to select the next hop, which easily fall

into the local optimal. Therefore, GPSR have the lowest APDR when the number of

vehicles is less than 300. QROUTING have the lowest APDR when the number of vehicles

is less than 300 because of routing loop and hence excessive conflicts. QLAODV performs

well than AODV in low and medium vehicle density by continuously learning the network

status through broadcast of hello packets. For high vehicle density, the result is reversed

because of high overhead of QLAODV in high vehicle density. ARPRL performs well than

QLAODV in all cases. This is because ARPRL optimizes QLAODV through periodic

learning, on-demand routing probe and the feedback of MAC layer; hence it shows an

advantage over QLAODV. On average, ARPRL improves the APDR by 23.4 and 22.6%

compared with that of QLAODV and AODV, respectively.

Figure 7 shows the Average End-to-End Delay(AEED) of each protocol for the suc-

cessfully delivered CBR packets with varying the number of vehicles. For AODV and

QLAODV, the AEED decreases as the number of vehicles increases from 50 to 350. This is

because the lower the vehicle density, the higher the probability of network partition, in

which packets need to be stored for further forwarding and thus the AEED is increased.

AODV shows highest AEED because of excessive route discoveries incurred by fast

movement of vehicles. For QLAODV, slow convergence and route loops introduced in the

learning process increase the AEED. ARPRL and QROUTING show similar AEED with

GPSR which has the lowest AEED. This is because the proactive Q-table maintenance in

ARPRL and QROUTING can switch the sub-optimal route while AODV and QLAODV

will not change to better routes until current active one breaks. For GPSR and QROUT-

ING, they perform better than ARPRL in the form of less AEED by 6.9 and 2.6 ms ,

respectively, as the result of route probe mechanism of ARPRL which introduces slight

additional delay. However, Compared with QLAODV and AODV, ARPRL reduces the

AEED by 162.2 and 384.8 ms, on average, respectively.

Figure 8 shows the Average Hops Count(AHC) of each protocol for the successfully

delivered CBR packets with varying the number of vehicles. In most cases, the AHC

decreases with the increase of vehicle density for all five protocols when the number of

vehicles is more than 50. This is because frequent network partitions results in increased

routing breaks and loops. In addition, frequent network topology changes also contribute to

this case happening. For QROUTING and ARPRL, the average hops increases with the

increase of vehicle numbers varying from 50 to 150 due to more and more vehicles

participating in the forward of packets. When the vehicle numbers varies from 150 to 350,

the average hops decreases due to more and more vehicles congesting at the intersections,

which contributes to finding a shorter route. In addition, AODV, QLAODV and ARPRL

adopt the route discovery strategy and accordingly have less AHC than that of QROUT-

ING and GPSR in most cases. More importantly, ARPRL shows significant fewer hops

than AODV and QLAODV at high vehicle density due to the routing probe and process of

packet loss notification of MAC layer. Compared with QLAODV and AODV, ARPRL

reduces the AHC by 3.58 and 4.44 hops on average, respectively.

2160 J. Wu et al.

123

Page 19: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

5.3.2 Performance for Varying Maximum Velocity

In this part, We evaluate the performance of each protocol by varying the vehicle maxi-

mum velocity from 1 to 30 m/s, while the number of vehicles and CBR packet interval is

fixed to 200 and 1 s, respectively. The statistic results are described below.

Fig. 7 Average end-to-end delay versus number of vehicles

Fig. 8 Average hops count versus number of vehicles

Reinforcement Learning Based Mobility Adaptive Routing for… 2161

123

Page 20: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Figure 9 shows the average packet delivery ratio (APDR) of each protocol with varying

the maximum allowable velocity. From Fig. 9, it can be concluded that the APDR of all

five protocols decreases when the maximum vehicle velocity varies from 1 to 30 m/s. This

is because the increase in vehicle velocity causes more changing network topology and

network partitions in which the number of packets dropped increases due to high vehicle

movement. As the velocity varies from 25 to 30 m/s, the packet delivery ratio of five

protocols tends to increase. The reason is that the packet carry time decreases when the

velocity varies from 25 to 30 m/s. Therefore, the number of packet dropped deceases due

to packet timeout. In ARPRL, we consider not only the number of hops like AODV, but

also overcomes the slow convergence and routing loops of QROUTING. In addition, LET

is also considered in the learning process, which further enhances route reliability. Thus, it

performs better than the other four protocols. On average, ARPRL increases the APDR by

20.3 and 24.8%, compared with that of QLAODV and AODV, respectively.

Figure 10 shows the Average End-to-End Delay(AEED) of each protocol for the suc-

cessfully delivered CBR packets with varying the maximum allowable velocity. Fig. 10

indicates that the AEED of all five protocols increases as the maximum vehicle velocity

varies from 5 to 30 m/s. This is because high mobility leads to rapid changes in network

topology, which further increases the possibility of selection of sub-optimal routing path,

hence increases the delay. In addition, High mobility also aggravates the network parti-

tions, which incurs packet carry and increases delay. When the maximum vehicle velocity

varies from 25 to 30 m/s, The duration of the network partition becomes shorter. Thus, the

packet carry time introduced by network partition is reduced and the AEED also tend to

decrease for all the five protocols. GPSR and QROUTING perform better than ARPRL in

the form of less AEED by 9.4 and 1.8 ms, respectively, due to route probe mechanism of

ARPRL. However, on average, ARPRL reduces the AEED by 112.3 and 284.6 ms,

compared with that of QLAODV and AODV, respectively.

Figure 11 shows the Average Hops Count(AHC) of each protocol for the successfully

delivered CBR packets with varying the maximum allowable velocity. The result shows

that the AHC increases as the maximum vehicle velocity varies from 1 to 10 m/s. This is

because with the velocity increasing, the frequency of route break increases. Since ARPRL

Fig. 9 Average packet delivery ratio versus maximum allowable velocity

2162 J. Wu et al.

123

Page 21: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

considers the number of hops and link expire time, it performs better than QLAODV and

AODV. As the maximum vehicle velocity varies from 15 to 30 m/s, the average hops

increases slightly for five protocols. The reason is that high velocity improves the network

connection and reduces the probability of network partition, which resulting in shorter

length of route path. On average, ARPRL reduces the AHC 3.3 and 3.9 hops, compared

with that of QLAODV and AODV, respectively.

Fig. 10 Average end-to-end delay versus maximum allowable velocity

Fig. 11 Average hops count versus maximum allowable velocity

Reinforcement Learning Based Mobility Adaptive Routing for… 2163

123

Page 22: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

5.3.3 Performance for Varying Data Generation Interval

After analyzing the effect of vehicle velocity on protocol performance, in this part, we

evaluate each protocol by varying the data generation interval from 0.1 to 6 s, while the

maximum allowable velocity and the number of vehicles is fixed to 15 m/s and 200,

respectively. The statistic results are described below.

In Fig. 12, we evaluate the Average Packet Delivery Ratio(APDR) of each protocol

with varying the data generation interval. As shown in Fig. 12, the APDR of ARPRL,

QLAODV and AODV decreases as the Packet Interval(PI) varies from 0.1 to 6 s. This is

because the increase in PI causes less frequency of route discovery in which the number of

packets dropped increases due to more invalid route path. For QROUTING and GPSR, the

APDR remain approximately constant in all configurations. This is mainly because the

routing path is maintained only through periodic HELLO packets in QROUTING and

GPSR. Fig. 12 also shows that ARPRL achieves the highest APDR in all cases of PI. This

can be explained by the fact that ARPRL combines the advantages of proactive routing

learning through distributed Q-Learning algorithm and reactive routing probe mechanism.

On average, ARPRL delivers 19.0 and 24.0% more packets than QLAODV and AODV,

respectively.

Figure 13 shows the Average End-to-End Delay(AEED) of each protocol for the suc-

cessfully delivered CBR packets with varying the data generation interval. As shown in

Fig. 13, ARPRL achieve a much lower AEED than QLAODV and AODV in all config-

urations of Packet Interval(PI). This is because QLAODV and AODV adopt route dis-

covery mechanism which introduces longer AEED. However, in APRRL, the number of

triggers of route discovery is much less than QLAODV and AODV due to the periodic

routing learning and hence the AEED is further reduced. For GPSR and QROUTING, on

average, they perform better than ARPRL in the form of less AEED by 1.5 and 7.8 ms,

respectively, due to route probe mechanism of ARPRL. However, ARPRL reduces the

AEED by 216.2 and 503.4 ms, on average, compared with that of QLAODV and AODV,

respectively.

Fig. 12 Average packet delivery ratio versus data generation interval

2164 J. Wu et al.

123

Page 23: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Figure 14 shows the Average Hops Count (AHC) of each protocol for the successfully

delivered CBR packets with varying the data generation interval. The AHC of AODV and

QLAODV is much higher than other three protocols as the Packet Interval (PI) varies from

0.1 to 6 s. The higher the CBR data rate, the more difference of AHC. This is expected

since because in both AODV and QLAODV, new better route will not be discovered

immediately until the current active one breaks. For ARPRL, QROUTING and GPSR,

their AHC stays approximately constant in all cases while GPSR has lowest AHC among

them. This is because GPSR adopts greedy forwarding strategy which always progressively

Fig. 13 Average end-to-end delay versus data generation interval

Fig. 14 Average hops count versus data generation interval

Reinforcement Learning Based Mobility Adaptive Routing for… 2165

123

Page 24: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

forward packets directly toward the direction of the destination. Although GPSR has the

minimal AHC, it is at the high expense of the cost of packet loss due to the local optimal

caused by greedy forwarding, which can be proved by Fig. 12. In addition, ARPRL has

lower AHC than QROUTING because of the route probe mechanism of ARPRL. On

average, ARPRL reduces the AHC 0.11, 3.4 and 4.3 hops, compared with that of

QROUTING, QLAODV and AODV, respectively.

6 Analysis of ARPRL

In this part, the Average Routing Overhead (ARO) is firstly evaluated and compared with

related some existing protocols. The ARO is defined as the ratio of average number of

bytes of non-data packets broadcast by vehicles for the routing maintenance to the average

number of bytes of data packets received by the destinations. This metric reflects the extra

communication overhead introduced by the routing protocols. In addition, the complexity

of ARPRL is also analyzed.

6.1 Routing Overhead Analysis

In ARPRL, the non-data packets consist of two categories: (1) periodical proactive HELLO

packets; (2) On-demand reactive Learning Probe REQuest/REPly (LPREQ and LPREP)

Packets. Periodical HELLO packets are the main source of Routing Overhead (RO)

introduced by ARPRL, however, it is imperative by ARPRL for real time sensing of

network changes. The significant difference pertain to RO between ARPRL and other

existing protocols under consideration (except for QROUTING) is the dynamic variant part

of HELLO packet of ARPRL, which is used to exchange Q table information between

neighbors.

Figure 15a presents the ARO of each protocol with varying the number of vehicles. As

shown in Fig. 15a, AODV, QLAODV and GPSR remain almost similar constant for all

configurations of number of vehicles. This is because the ARO mainly depends on the

average bytes of broadcast of control packets. For AODV and QLAODV, it is determined

by the number of RREQ packets broadcast in the routing discovery process. This is why

the ARO of AODV and QLAODV increase slightly with the increase of number of

vehicles as the CBR data rare is fixed at one packet per second. For GPSR, fixed length of

periodic HELLO packets are the mainly source of RO for neighbors position maintenance.

Therefore, it’s ARO stays constant with the increase of number of vehicles. For

QROUTING, the ARO is determined by the variable length of periodic broadcast of

HELLO packets for routing learning, which increases linearly with the increase of number

of vehicles. For ARPRL, the LPREQ packets also contribute to part of ARO besides the

HELLO packets which are the same as that of QROUTING. This is why the ARO of

ARPRL and QOURING increases linearly with the increase of number of vehicles

meanwhile ARPRL has slightly more ARO than QROUTING at high density of vehicles.

In Fig. 15b, it can be observed that the ARO of AODV and QLAODV increases with

the increase of maximum allowable vehicle velocity. This is expected since the increase in

vehicle velocity causes more changing network topology and hence the number of triggers

of route discovery increases. In contrast, GPSR remain constant of ARO in all cases and

has lowest ARO in all five protocols. This is due to the length and number of periodic

HELLO packets in GPSR are independent of changing network topology. In general,

2166 J. Wu et al.

123

Page 25: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

ARPRL and QROUTING have approximately the same ARO at low and medium velocity.

At high velocity, ARPRL has slightly more ARO than QROUTING due to the increase of

the number of broadcast of LPREQ packets.

Figure 15c shows that the ARO of AODV and QLAODV decreases with the increase of

packet interval since the number of route discovery is proportional to the number of

packets to transmit. Meanwhile, ARPRL, QROUTING and GPSR have almost constant of

ARO in all cases. This can be explained by the fact that periodic HELLO packets are

independent of data generation rate. At high data generation rate, ARPRL has slightly more

ARO than QROUTING due to the adoption of learning probe mechanism in ARPRL.

In summary, ARPRL shows higher routing overhead than the other four protocols as

shown in Fig. 15. This is expected since the combination of proactive routing learning

algorithm and reactive routing probe strategy inevitably causes higher overhead but

improves comprehensive performance. However, how to efficiently further reduce the

ARO of ARPRL will be considered as our future work.

6.2 Complexity Analysis

Through the above experimental results, we can conclude that ARPRL is more suitable in

VANET environment for higher data delivery success rate, lower delay and fewer routing

hops since a variety of optimizing strategies is adopted based on AODV and QLAODV.

(a) (b)

(c)

Fig. 15 Average routing overhead versus. a Number of vehicles. b Maximum velocity. c Packet interval

Reinforcement Learning Based Mobility Adaptive Routing for… 2167

123

Page 26: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

However, it is also necessary to analyze the time and space complexity of A when applying

ARPRL to VANET. For ARPRL, the time complexity depends mainly on the maintenance

of the Q table, which consists of three parts. The time required for each part is O(1). For a

network with N vehicle nodes, it is clear that the time complexity of ARPRL is O(N). The

spatial complexity of ARPRL depends mainly on the memory space required to build the Q

table. In the network with N vehicle nodes, obviously, the space complexity of ARPRL is

OðN3Þ in the worst case, which is higher than other four protocols. Fortunately, however,

for VANET, this is acceptable because each vehicle can be equipped with a computing

device with enough high processing capability.

7 Conclusion

In this paper, we proposed ARPRL, a reinforcement learning based heuristic routing

protocol for VANETs. ARPRL employs Q Learning to dynamically learn the optimal

stable and reliable route through a variety of strategies to update the Q table maintained by

each vehicle node. Periodic exchange of HELLO messages between neighbour vehicles,

forwarding of DATA packets and MAC layer feedback mechanism are used to assist in the

updating of the Q table. In order to boost the convergence of learning process, LPREQ and

LPREP messages were used at the begin of learning process. We also designed the

structure of HELLO message for the exchange of optimal part of Q table contends and

avoid the occurrence of route loops at some extent. More importantly, we proposed a novel

Q value update function which takes into consideration the distinguish features of

VANETs. ARPRL forwards data packets according to the Q table which is updated

through Q learning algorithm and takes the number of hops, vehicle mobility and link

expired time into account, thus it performs better and is more suitable for packet loss and

delay sensitive applications.

Acknowledgements We would like to appreciate the editors and the anonymous reviewers for their helpfulcomments and suggestions. This work is supported by National Natural Science Foundation of China (GrantNo. 61472305), Aeronautical Science Foundation of China (Grant No. 20151981009) and Science ResearchProgram, Xian, China (Grant No. 2017073CG/RC036(XDKD003)).

References

1. Campolo, C., Molinaro, A., & Scopigno, R. (2015). Vehicular ad hoc networks standards, solutions andresearch. Berlin: Springer.

2. Li, F., & Wang, Y. (2007). Routing in vehicular ad hoc networks: A survey. IEEE Vehicular Tech-nology Magazine, 2(2), 12–22.

3. Yun-Wei, L. I. N., Yuh-Shyan, C. H. E. N., & Sing-Ling, L. E. E. (2010). Routing protocols in vehicularad hoc networks: A survey and future perspectives. Journal of Information Science and Engineering, 26,913–932.

4. Chen, W., Guha, R. K., Kwon, T. J., Lee, J., & Hsu, Y.-Y. (2011). A survey and challenges in routingand data dissemination in vehicular ad hoc networks. Wireless Communications and Mobile Computing,11(7), 787–795.

5. Zeadally, S., Hunt, R., Chen, Y.-S., Irwin, A., & Hassan, A. (2012). Vehicular ad hoc networks(VANETs): Status, results and challenges. Telecommunication Systems, 50(4), 217–241.

6. Survey, A., Sharef, B. T., Alsaqour, R. A., & Ismail, M. (2014). Vehicular communication ad hocrouting protocols. Journal of Network and Computer Applications, 40, 363–396.

7. Sutton, R. S., & Barto, A. G. (2011). Reinforcement learning: An introduction (Vol. 1). Cambridge:Cambridge Univ Press.

2168 J. Wu et al.

123

Page 27: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

8. Kiumarsi, B., Lewis, F. L., Modares, H., Karimpour, A., & Naghibi-Sistani, M.-B. (2014). Rein-forcement Q-learning for optimal tracking control of linear discrete-time systems with unknowndynamics. Automatica, 50(4), 1167–1175.

9. Clausen, T., & Jaqcquet, P. (2003). Optimized link state routing (OLSR). IETF Networking Group,RFC, 3626, 1–75.

10. Alslaim, M. N., Alaqel, H. A, & Zaghloul, S. S. (2014). A comparative study of MANET routingprotocols. In 2014 third international conference on technologies and networks for development(ICeND) (pp. 178–182). IEEE.

11. Perkins, C., Belding-Royer, E., Das, S., et al. (2003). Ad-hoc on-demand distance vector (AODV)routing. IETF Networking Group, RFC, 3561, 1–38.

12. Beijar, N. (2002). Zone routing protocol (ZRP). Networking Laboratory, Helsinki University of Tech-nology, Finland, 9, 1–12.

13. Fonseca, A., & Vazao, T. (2013). Applicability of position-based routing for VANET in highways andurban environment. Journal of Network and Computer Applications, 36(3), 961–973.

14. Kumar, S., & Verma, A. K. (2015). Position based routing protocols in VANET: A survey. WirelessPersonal Communications, 83(4), 2747–2772.

15. Liu, J., Wan, J., Wang, Q., Deng, P., Zhou, K., & Qiao, Y. (2016). A survey on position-based routingfor vehicular ad hoc networks. Telecommunication Systems, 62(1), 15–30.

16. Goel, N., Sharma, G., & Dhyani, I. (2016). A study of position based VANET routing protocols. In 2016international conference on computing, communication and automation (ICCCA) (pp. 655–660). IEEE.

17. Kenichi, M. A. S. E. (2016). A survey of geographic routing protocols for vehicular ad hoc networks asa sensing platform. IEICE Transactions on Communications, 99(9), 1938–1948.

18. Karp, B., & Kung, H.-T. (2000). GPSR: Greedy perimeter stateless routing for wireless networks. InProceedings of the 6th annual international conference on mobile computing and networking (pp.243–254). ACM.

19. Sood, M., & Kanwar, S. (2014). Clustering in MANET and VANET: A survey. In 2014 internationalconference on circuits, systems, communication and information technology applications (CSCITA) (pp.375–380). IEEE.

20. Yang, P., Wang, J., Zhang, Y., Tang, Z., & Song, S. (2015). Clustering algorithm in VANETs: Asurvey. In 2015 IEEE 9th international conference on anti-counterfeiting, security, and identification(ASID) (pp. 166–170). IEEE.

21. Cooper, C., Franklin, D., Ros, M., Safaei, F., & Abolhasan, M. (2016). A comparative survey ofVANET clustering techniques. IEEE Communications Surveys & Tutorials, 19(1), 657–681.

22. Sucasas, V., Radwan, A., Marques, H., Rodriguez, J., Vahid, S., & Tafazolli, R. (2016). A survey onclustering techniques for cooperative wireless networks. Ad Hoc Networks, 47, 53–81.

23. Anupama, M., & Sathyanarayana, B. (2011). Survey of cluster based routing protocols in mobile ad-hocnetworks. International Journal of Computer Theory and Engineering, 3(6), 806.

24. Lin, C. R., & Gerla, M. (1997). Adaptive clustering for mobile wireless networks. IEEE Journal onSelected Areas in Communications, 15(7), 1265–1275.

25. Chatterjee, M., Das, S. K., & Turgut, D. (2002). WCA: A weighted clustering algorithm for mobile adhoc networks. Cluster Computing, 5(2), 193–204.

26. Jaiswal, S., & Adane, D. D. S. (2013). Hybrid approach for routing in vehicular ad-hoc network(VANET) using clustering approach. International Journal of Innovative Research in Computer andCommunication Engineering, 1(5), 1211–1219.

27. Kakkasageri, M. S., & Manvi, S. S. (2014). Connectivity and mobility aware dynamic clustering inVANETs. International Journal of Future Computer and Communication, 3(1), 5.

28. Boyan, J. A., Littman, M. L., et al. (1994). Packet routing in dynamically changing networks: Areinforcement learning approach. In Advances in Neural Information Processing Systems, pp. 671–678.

29. Dowling, J., Curran, E., Cunningham, R., & Cahill, V. (2005). Using feedback in collaborative rein-forcement learning to adaptively optimize MANET routing. IEEE Transactions on Systems andCybernetics-Part A: Systems and Humans, 35(3), 360–372.

30. Celimuge, W. U., Kumekawa, K., & Toshihiko, K. A. T. O. (2010). Distributed reinforcement learningapproach for vehicular ad hoc networks. IEICE Transactions on Communications, 93(6), 1431–1442.

31. Plate, R., & Wakayama, C. (2015). Utilizing kinematics and selective sweeping in reinforcementlearning-based routing algorithms for underwater networks. Ad Hoc Networks, 34, 105–120.

32. Santhi, G., Nachiappan, A., Ibrahime, M. Z., Raghunadhane, R., & Favas, M. K. (2011). Q-learningbased adaptive QoS routing protocol for MANETs. In 2011 international conference on recent trends ininformation technology (ICRTIT) (pp. 1233–1238). IEEE.

33. Royer, E. M., & Perkins, C. E. (2000). Multicast ad hoc on-demand distance vector (MAODV) routing.IETF Draft, 1, 10–25.

Reinforcement Learning Based Mobility Adaptive Routing for… 2169

123

Page 28: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

34. Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming.Hoboken: Wiley.

35. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1). Cambridge:MIT Press.

36. SNT. (2014). QualNet 7.1. http://web.scalable-networks.com.37. Harri, J., Fiore, M., Filali, F., & Bonnet, C. (2011). Vehicular mobility simulation with VanetMobiSim.

Simulation, 87(4), 275–300.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institu-

tional affiliations.

Jinqiao Wu received the MS degree in 2014 from the Xi’an Universityof Post & Telecomunications, Xi’an, China. He is currently a Ph.D.candidate in computer science at Xidian University, Xi’an, China. Hisresearch interests include machine learning, networking architectures,and routing protocols.

Min Fang received her B.S. degree in computer control, M.S. degreein computer software engineering and Ph.D. degree in computerapplication from Xidian University, Xi’an, China, in 1986, 1991 and2004, respectively, where she is currently a professor. Her researchinterests include intelligent information process, multi-agent systemand network technology.

2170 J. Wu et al.

123

Page 29: Reinforcement Learning Based Mobility Adaptive Routing for ...static.tongtianta.site/paper_pdf/7636a2a0-a86d-11e9-afe7-00163e08… · A novel mobility adaptive routing protocol suitable

Xiao Li received BS degree from Xi’an University of Finance andEconomics, Xi’an, China, in 2012. She is currently a Ph.D. candidatein computer science at Xidian University, Xi’an, China. Her researchinterests include pattern recognition, machine learning and computervision.

Reinforcement Learning Based Mobility Adaptive Routing for… 2171

123