Distributed Reinforcement Learning for Age of Information ...

Distributed Reinforcement Learning for Age of InformationMinimization in Real-Time IoT Systems

Sihua Wang, Student Member, IEEE, Mingzhe Chen, Member, IEEE, Zhaohui Yang, Member, IEEE,Changchuan Yin, Senior Member, IEEE, Walid Saad, Fellow, IEEE,Shuguang Cui, Fellow, IEEE, and H. Vincent Poor, Fellow, IEEE

Abstract—In this paper, the problem of minimizing theweighted sum of age of information (AoI) and total energyconsumption of Internet of Things (IoT) devices is studied. In theconsidered model, each IoT device monitors a physical processthat follows nonlinear dynamics. As the dynamics of the physicalprocess vary over time, each device must find an optimal samplingfrequency to sample the real-time dynamics of the physical systemand send sampled information to a base station (BS). Due tolimited wireless resources, the BS can only select a subset ofdevices to transmit their sampled information. Thus, edge devicesmust cooperatively sample their monitored dynamics based onthe local observations and the BS must collect the sampledinformation from the devices immediately, hence avoiding theadditional time and energy used for sampling and informationtransmission. To this end, it is necessary to jointly optimizethe sampling policy of each device and the device selectionscheme of the BS so as to accurately monitor the dynamicsof the physical process using minimum energy. This problem isformulated as an optimization problem whose goal is to minimizethe weighted sum of AoI cost and energy consumption. Tosolve this problem, we propose a novel distributed reinforcementlearning (RL) approach for the sampling policy optimization. Theproposed algorithm enables edge devices to cooperatively find theglobal optimal sampling policy using their own local observations.Given the sampling policy, the device selection scheme can beoptimized thus minimizing the weighted sum of AoI and energyconsumption of all devices. Simulations with real data of PM 2.5pollution show that the proposed algorithm can reduce the sum ofAoI by up to 17.8% and 33.9% and the total energy consumptionby up to 13.2% and 35.1%, compared to a conventional deep Qnetwork method and a uniform sampling policy.

Index Terms—Physical process, sampling frequency, age ofinformation, distributed reinforcement learning.

I. INTRODUCTION

For Internet of Things (IoT) applications such as envi-ronmental monitoring and vehicle tracking, the freshness ofthe status information of the physical process at the devicesis of fundamental importance for accurate monitoring and

S. Wang and C. Yin are with the Beijing Laboratory of Advanced Information Net-work, and the Beijing Key Laboratory of Network System Architecture and Convergence,Beijing University of Posts and Telecommunications, Beijing 100876, China. Emails:[email protected]; [email protected].

M. Chen is with the Department of Electrical and Computer Engineering, PrincetonUniversity, Princeton, NJ, 08544, USA, Email: [email protected].

Zhaohui Yang is with the Department of Engineering, King’s College London, WC2R2LS, UK, Email: [email protected].

W. Saad is with the Wireless@VT, Bradley Department of Electrical and ComputerEngineering, Virginia Tech, Blacksburg, VA, 24060, USA, Email: [email protected].

S. Cui is with the Shenzhen Research Institute of Big Data (SRIBD) and the FutureNetwork of Intelligence Institute (FNii), Chinese University of Hong Kong, Shenzhen,518172, China, Email: [email protected].

H. V. Poor is with the Department of Electrical and Computer Engineering, PrincetonUniversity, Princeton, NJ, 08544, USA, Email: [email protected].

control. To quantify the freshness of the status informationof sensor data, age of information (AoI) has been proposedas a performance metric [1]. AoI is defined as the durationbetween the current time and the generation time of the mostrecently received status update. Compared to conventionaldelay metrics that measure queuing or transmission latency,AoI considers the generation time of each measurement, thuscharacterizing the freshness of the status information from theperspective of the destination. Therefore, optimizing AoI inIoT leads to distinctively different system designs from thoseused for conventional delay optimization.

A. Related Works

The existing literatures such as in [2]–[7] focused on manykey AoI problems in IoT settings. In particular, in [2], theauthors optimized wireless resource allocation to minimizethe average instantaneous AoI. The authors in [3] derived theanalytical expression of the average AoI for different sensorsin a computing-enabled IoT system. In [4], an uplink grant-free massive access protocol is introduced for an IoT networkwith multiple channels to minimize the sum AoI of the IoTdevices. The authors in [5] optimized the AoI of each userunder a sampling cost constraint. The AoI of each energyharvesting transmitter is minimized for both first-come first-serve and last-come first-serve systems in [6]. The authors in[7] designed an age-oriented relaying protocol to minimizethe average AoI of IoT devices. However, the existing worksin [2]–[7] only investigated the optimization of the samplingpolicy without considering the dynamics of the physical pro-cess. In fact, the dynamics of a realistic physical process in acyber-physical system such as the IoT will strongly influencethe optimization of the sampling policy of each device andthe device selection scheme of the base station (BS). Forexample, as the physical process varies rapidly, an IoT devicemust increase its sampling frequency to capture these physicaldynamics. Meanwhile, the BS must immediately allocate thelimited wireless resources to those devices that have a higherfrequency for physical dynamics transmission. In contrast, asthe physical process slowly varies, an IoT device can saveenergy by reducing its sampling frequency. Therefore, theanalysis of the real-world dynamics in each physical processwill seriously affect the optimization of the sampling andtransmission schemes. However, since the dynamics of themonitored physical process are not available for the BS untilthe devices sample the physical process and upload theirsampled information to the BS successfully, the BS may

arX

iv:2

104.

0152

7v2

[cs

.IT

] 8

Sep

202

1

[email protected];

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

not be able to find the optimal sampling and transmissionschemes using the traditional optimization methods in [2]–[7]. To address this challenge, one promising solution is to usereinforcement learning (RL) to allow the BS to estimate thedynamics of monitored physical process and find the optimalsampling and transmission schemes.

Recently, a number of existing works such as in [8]–[12] used RL algorithms to solve problems involving AoIas a performance measure. In [8], the authors developed anRL algorithm for optimizing resource allocation so as tominimize the sum of AoI of all source nodes. The authorsin [9] used RL methods to make scheduling decisions that areresilient to network conditions and packet arrival processes.In [10], the authors optimized the caching content updatescheme to minimize the long-term average AoI of users ina heterogeneous network. The authors in [11] proposed alow complexity RL algorithm to minimize the sum of theexpected AoI of all sensors. In [12], the authors studied theuse of a new RL framework to optimize the AoI in a drone-assisted wireless network. However, most of these works [8]–[12] used centralized RL algorithms to determine the samplingand transmission decisions of all devices. In such centralizedscenarios, each edge device can only sample the physicalprocess after receiving the sampling command from the BS,which incurs an additional delay for environment monitoringand control. Moreover, using centralized RL algorithms, theBS must update the sampling and transmission schemes basedon the entire set of the devices’ local observations and actionswhose dimension increases exponentially with the number ofdevices. To address these challenges, one can use distributedRL solutions allowing each device to train its own machinelearning model so as to determine the sampling action imme-diately. The authors in [13] proposed the use of a distributedRL to minimize the energy used to transmit the sampledinformation under the AoI constraint. In [14], the authorsdeveloped a distributed deep RL algorithm to optimize device-to-device packet delivery over limited spectrum resources. Theauthors in [15] used a distributed sense-and-send protocolto minimize the average AoI. However, using the distributedalgorithms in [13]–[15], each device can only train its locallearning model with the local observation of the monitoredphysical process. Therefore, IoT devices may not be ableto find an optimal sampling and transmission schemes. Inconsequence, it is necessary to develop a novel distributed RLalgorithm allowing IoT devices to cooperatively update the RLparameters based on the individual observation of each device,thus finding the optimal sampling and transmission schemes.

B. Contributions

The main contribution of this paper is a novel frameworkthat enables a BS and devices in an IoT system to coop-eratively monitor realistic physical processes simultaneouslywith a minimum AoI cost and energy consumption. Our keycontributions include:• We consider a real-time IoT system in which cellular-

connected wireless IoT devices transmit their sampledinformation of numerous monitored physical processes

to a BS that captures the dynamics of each physicalprocess. For the considered model, the impact of theactual dynamics of each physical process on the samplingfrequency of each device is explicitly considered. Inaddition, the wireless resources used for dynamic processtransmission are limited and, hence, the BS needs toselect an appropriate subset of devices to upload theirstatus packets so as to reconstruct the monitored physicalprocesses accurately.

• For this purpose, we first derive a closed-form expressionfor the relationship between the dynamics of the physicalprocess and the sampling frequency of each device.Based on this relationship, the BS and IoT devices cancooperatively adjust the dynamic process sampling anduploading scheme so as to enable the BS to accuratelyreconstruct the monitored physical process. This jointsampling and device selection problem is formulated asan optimization problem whose goal is to minimize theweighted sum of AoI and energy consumption of alldevices.

• To solve this optimization problem, a distributed QMIXalgorithm is proposed to find the global optimal sam-pling policy for the devices. Compared to traditional RLalgorithms, the proposed method enables each device touse its local observation to estimate the Q-value underglobal observation. Thus, with the proposed distributedQMIX algorithm, devices can find the optimal samplingpolicy using their local observations and yield a betterperformance compared to the one achieved in [13]–[15]. Given the sampling policies of all devices, the BScan directly optimize the device selection scheme usingdynamic programming.

Simulations with real data of PM 2.5 pollution show that,compared to the conventional deep Q-network (DQN) methodand the uniform sampling policy, the proposed algorithm canreduce the sum of AoI by up to 17.8% and 33.9% and the totalenergy consumption by up to 13.2% and 35.1%, respectively.To the best of our knowledge, this is the first work thatconsiders the optimization of the sampling policy and deviceselection scheme for a real-time IoT system that consists ofnumerous realistic physical processes.

The rest of this paper is organized as follows. The systemmodel and the problem formulation are described in SectionII. Section III discusses the proposed learning framework forthe optimization of the sampling policy and device selectionscheme. In Section IV, numerical results are presented anddiscussed. Finally, conclusions are drawn in Section V.

II. SYSTEM MODEL AND PROBLEM FORMULATION

Consider a real-time IoT system that consists of a BS anda set M of M distributed IoT devices. In the consideredmodel, each IoT device is equipped with a sensor and atransmitter. In particular, the sensor is used to monitor the real-time status of a physical system (e.g., an atmospheric samplerthat monitors the variation of the atmospheric environment)and the transmitter is used to send the monitored informationto the BS through a wireless channel, as illustrated in Fig.

Fig. 1. An illustration of the considered IoT network.

TABLE INOTATION

Notation DescriptionM Number of devicesxm,t Dynamics of the physical processxm,t Estimation of the physical processAm Linear coefficient matrixεm,t Random processym,t Estimation errorΩm,t Maximum variation frequency of the dynamics∆m,t Maximum sampling intervalξm Minimum sample frequencyst Sampling action vectorut Resource allocation vectorτ Duration of each time slot mPT Transmission power of each deviceZm Data size of each sampled packetI Number of resource blockslm,t Uplink transmission delayφm,t AoI at device mΦm,t AoI at the BSCs Sampling cost for each packetem Energy consumptionγE Weighting parameter of energyγA Weighting parameter of AoI

1. Next, we first introduce the model of the physical process.Then, we introduce the AoI model to measure the freshness ofmonitored information of the physical process at IoT devicesand the BS, respectively. Table I provides a summary of thenotations used throughout this paper.

A. Model of Physical Process

We consider heterogeneous nonlinear time-varying dynam-ics to describe the variation of the physical process monitoredby the IoT devices. These dynamics of the physical processover discrete time t can be given by [16]

xm,t+1 = Amxm,t + fm(xm,t) + εm,t, (1)

where xm,t ∈ RZm is the system state vector sampled bydevice m at time slot t with Zm representing the data sizeof status information of device m and εm,t is an boundeddisturbance independent of the system state. fm(·) : RZm→RZm is a nonlinear function satisfying fm(0) = 0. Am is aconstant matric related to the linear dynamic systems. Notethat, (1) has been widely used to model the physical processof nonlinear dynamic systems such as wide-area irrigationsystems, electric power grids, automated highway systems, andenvironmental detection systems. For example, the dynamicsof the atmospheric environment quality can be captured by (1)with xm,t being the current air pollution index and xm,t+1

being the dynamics of the air pollution index while Amxm,tand fm(xm,t) represent the linear and nonlinear function tocapture the effects of wind and precipitation. Using (1), thecurrent system state can be estimated based on the latestobserved state, which is given by [17]

xm,t =Aδ(t)m xm,t−δ(t)+

δ(t)∑q=1

Aq−1m fm(xm,t−q), (2)

where xm,t−δ(t) is the latest status information generated attime slot t−δ(t) with δ(t) being the duration of the generationtime between xm,t and xm,t−δ(t). Given the estimation of thesystem state vector at time slot t, the state estimation errorcan be expressed as

ym,t = xm,t− xm,t. (3)

In fact, ym,t measures the estimation error of the currentdynamics using the generated physical process model. Usingym,t, each device can determine the sampling frequency ateach time slot. To this end, we first need to calculate themaximum variation frequency of the physical process byanalyzing the nonlinear dynamics of the physical system. Forthis purpose, (3) can be linearly approximated by [18]

dym,tdt

= (Am,t+Jfm(xm,t)) · ym,t + o(‖ym,t‖), (4)

where (Am,t+Jfm(xm,t))·ym,t is a first-order approximationwith Jfm(xm,t) being the Jacobian matrix of function fm ando(‖ym,t‖) a high-order approximation that can be neglectedcompared to (Am,t+Jfm(xm,t)) ·ym,t. Then, we diagonalizeAm,t+Jfm(xm,t) to obtain the maximum variation frequencyof the physical process at time slot t, which is given by

Am,t+Jfm(xm,t) = U · diag(µ1,t, · · · , µZm,t) ·U−1, (5)

where diag(µ1,t,· · ·, µZm,t) is a diagonal matrixwith (µ1,t,· · ·, µZm,t) being the eigenvalues andU = [u1, · · · ,uZm

] is a non-singular matrix with uzm ∈ RZm

being the corresponding eigenvectors of Am,t+Jfm(xm,t).Based on (5), the time-domain maximum variation frequencyof the physical process can be computed as [18]

Ωm,t=maxzm,t∈Zm,t

∣∣Im[µzm,t]∣∣+√‖ym,t‖22+‖εm,t‖22

ξ2m−minzm,t∈Zm,t

Re[µzm,t]2,

(6)where Zm,t is the set of zm,t and ξm is a minimum frequencythat device m can distinguish, Im[µzm,t

] and Re[µzm,t] are

the imaginary part and real part of µzm,t, respectively. By

assigning the sampling frequency Fm,t = Ωm,t/π basedon Nyquist theory, the maximum sampling interval of thedynamic physical process ∆m,t can be given by

∆m,t = 1/Fm,t = π/Ωm,t. (7)

From (6) and (7), we can see that the maximum samplinginterval ∆m,t is related to ym,t. As ym,t increases, themaximum variation frequency Ωm,t increases and hence, ∆m,t

decreases. This is due to the fact that, as the state estimationerror ym,t increases, (1) cannot describe the physical processaccurately, which imples device m must increase its samplingfrequency so as to collect more status information to capturethe variation in the physical process and correct (1).

B. AoI Models for IoT Devices

Different from the existing studies [2]–[7] in which theAoI at edge device only depends on the time interval δm(t)between the consecutive sampling status, in this paper, we con-sider the dynamic sampling frequency of a real-time physicalsystem, thus, the AoI at each device m will be affected by themaximum sampling interval ∆m,t and δm(t), which is givenby

φm,t(sm,t) =

max0, δm(t)−∆m,t, if sm,t = 1,minφm,t−1+τ, φmax, otherwise,

(8)

where max0, δm(t)−∆m,t represents the AoI for devicem after sampling the physical process (i.e., sm,t = 1) withδm(t)−∆m,t being the time interval between the currentsampling action and the maximum sampling interval of thephysical process. Meanwhile, minφm,t−1+τ, φmax repre-sents the AoI for device m that does not sample the physicalprocess (i.e., sm,t = 0) with τ being the duration of eachtime slot and φmax being the maximum sampling interval.From (8), we can see that, as device m samples the physicalprocess at time t and the time interval δm(t) is smaller thanthe maximal sampling interval ∆m,t (i.e., δm(t)−∆m,t< 0),the AoI at device m decreases to zero. This is because whenδm(t) is smaller than ∆m,t, the sampling frequency of devicem satisfies the constraint of the Nyquist-Shannon samplingtheorem and thus, the sampled information can accuratelyrepresent the variation of the dynamic physical process. Incontrast, when δm(t)−∆m,t>0, the AoI at device m decreasesto δm(t)−∆m,t. This is due to the fact that the samplingfrequency of device m cannot satisfy the constraint of theNyquist-Shannon sampling theorem.

C. AoI Model for the BS

After generating the sampling information at time slot t,device m requests to the BS for sampled information transmis-sion. In the considered system, each device can only samplethe monitored physical process after sending the current sam-pled information. An orthogonal frequency division multipleaccess (OFDMA) transmission scheme is used for sampledinformation transmission. We assume that the BS can allocatea set I of I uplink orthogonal RBs to the devices. We alsoassume that each RB can be allocated to at most one device.The data rate of device m transmitting sampled informationto the BS over each RB i is

rm,t (um,t) = um,tW log2

(1+

PThm,tσ2

N

), (9)

where W is the bandwidth of each RB, PT is the transmitpower of each device m, and um,t ∈ 0, 1 is a deviceselection index at time t with um,t = 1 implying that devicem is selected by the BS to upload the sampled informationat time slot t, and um,t = 0, otherwise. hm,t is the channelgain between device m and the BS. σ2

N represents variance ofthe additive white Gaussian noise. Based on (9), the uplinktransmission delay between device m and the BS is given by

lm,t (um,t) =Zm

rm,t (um,t). (10)

Given the uplink transmission delay, the AoI at the BS fordevice m can be expressed as

Φm,t(sm,t, um,t)=

φm,t(sm,t)+lm,t(um,t), if um,t = 1,minΦm,t−1+τ, Φmax, otherwise,

(11)where Φmax is the maximum sampling interval. From (11),we can see that, if device m sends the sampled informationto the BS at time slot t, then the AoI at BS will be updatedto φm,t + lm,t, otherwise, the AoI increases by τ .

Since the BS monitors multiple physical processes, weadopt the sum AoI at the BS as a scalar quantity to measurethe information freshness. The randomness of each physicalprocess will affect the AoI value and, hence, the average AoIis considered. We define the average sum AoI as

Φt(st,ut)=1

t

t∑i=1

E

[M∑m=1

Φm,i(sm,i, um,i)

]. (12)

where E[·] is the expectation taken over the Rayleigh fad-ing channel gain gm,t, st = [s1,t, . . . , sM,t] and ut =[u1,t, . . . , uM,t] are the sampling and the device selectionvectors, respectively.

D. Energy Consumption Model

In our model, the energy used for each IoT device to sampleand transmit the sampled information is

em,t(sm,t, um,t) = sm,tCS + PT lm,t(um,t), (13)

where sm,tCS is the energy consumption for sampling withCS being the cost for sampling the physical process andPT lm,t(um,t) is the energy consumption for transmitting thesampled information. Moreover, since the BS is supplied bya continuous power source, we do not consider the energyconsumption of the BS. The average sum energy consumptionis given by

et(st,ut)=1

t

t∑i=1

E

[M∑m=1

em,i(sm,i, um,i)

]. (14)

E. Problem Formulation

Next, we introduce our optimization problem. Our goal isto minimize weighted sum of the AoI and energy consumptionof all devices via optimizing the sampling vector st and thedevice selection vector ut. The optimization problem is givenby

minst,ut

(γAΦt(st,ut) + γEet(st,ut)

)(15)

s. t. sm,t, um,t ∈ 0, 1 ,∀m ∈M,∀t ∈ T , (15a)∑m∈M

um,t 6 I, ∀m ∈M,∀t ∈ T , (15b)

where γA and γE are the scaling parameters. (15a) guaranteesthat each device can sample the physical process once andcan only occupy at most one RB for sampled informationtransmission at each time slot. (15b) ensures that each uplinkRB can be allocated to at most one device.

The problem in (15) is challenging to solve by conventionaloptimization algorithms due to the following reasons. First, tofind the optimal the sampling and transmission schemes usingthe traditional optimization algorithms, the BS must collect theinformation related to the dynamics of each monitored phys-ical process. However, the dynamics of the physical processmonitored by each device is not available for the BS until thedevices sample the physical process and upload the sampledinformation to the BS successfully. Thus, each device mustdetermine the sampling and transmission schemes based on thecurrent dynamics of the monitored physical process. However,each device can only observe and analyze the local environ-ment to determine its own policy. With partial observation,each device cannot find the optimal sampling and transmissionschemes using the traditional optimization methods whichrequire the dynamics of all monitored physical processes.In consequence, we propose a distributed RL algorithm thatenables each device to use its local observation to estimatethe Q-value under global observation and thus, cooperativelyoptimize the sampling and transmission schemes to minimizethe weighted sum of the AoI and energy consumption.

III. QMIX METHOD FOR OPTIMIZATION OF SAMPLING

POLICY

In this section, a novel distributed RL approach for optimiz-ing the sampling policy st in (15) is proposed. In particular, thecomponents of the proposed RL method is firstly introduced.Then, the process of using the proposed RL method to find theglobal optimal sampling policy for each device is explained.Given the sampling policy of each device, problem (15)is simplified and directly solved by dynamic programming.Finally, we analyze the convergence and complexity of theproposed RL method.

A. Components of Distributed RL Method

The proposed distributed RL method consists of six compo-nents: a) agents, b) actions, c) states, d) reward, e) individualvalue function, and f) global value function, which are speci-fied as follows:• Agents: The agents that perform the proposed RL algo-

rithm are the distributed IoT devices. In particular, at eachslot, each IoT device must decide whether to sample thephysical process based on the local observation.

• Actions: An action of each device m is sm,t that repre-sents the sampling policy of each device at time slot t.Thus, the vector of all devices’ actions at time slot t isst = [s1,t, . . . , sM,t].

• States: An environment state is defined as ot =[o1,t, . . . ,oM,t] where om,t=[φm,t, Fm,t,sm,t−1, um,t−1]represents the local observation of device m with φm,tbeing the current AoI of device m and Fm,t being thecurrent sampling frequency of device m. Here, sm,t−1and um,t−1 are the recorded historical sampling andtransmission policy at time slot t−1, respectively. In theconsidered model, each device m that is used to monitora physical process can only observe its local state om,t.

• Reward: The reward of any sampling action on eachdevice captures the weighted sum of the AoI and energyconsumption resulting from the generation of the sampledinformation. Thus, the reward of each device can only beobtained when the sampled information is received by theBS successfully. To this end, the BS must first determinethe device selection scheme ut, which can be found bythe following theorem:Theorem 1. Given the global state ot and the samplingpolicy st, the optimal device selection scheme ut is givenby:

u∗m,t =

1, if m ∈M1

0, otherwise.(16)

where M1 = m ∈M|Cm,t,1 < 0 with Cm,t,1= γA(E [lm,t(1)]+φm,t(sm,t)−Φm,t−1−τ)+γEPTE [lm,t(1)].

Proof: See Appendix A.From Theorem 1, we see that, the BS must first collect

the global environment state ot and the sampling actionsof all devices st to find ut. Naturally, in the consideredmodel, ot and st can be obtained by the BS. This isbecause, as device m samples the monitored physicalprocess, the BS will receive a transmission request fromdevice m and allocate an RB to device m. Otherwise,the BS cannot receive the request message from devicem. Based on the information of the received requests, theBS is aware of the sampling action of each device so asto obtain st. Moreover, using the recorded information,the historical sampling action sm,t−1 and device selectionaction um,t−1 are both available for the BS. In addition,after receiving the sampled information, the AoI of eachdevice φm,t and the sampling frequency Fm,t can becalculated using sm,t, sm,t−1, and um,t−1. Thus, the BScan obtain the global environment state ot. Given theglobal state ot and the sampling policy st, the optimaldevice selection scheme ut can be determined. Based onTheorem 1, the reward function of each device is givenbyRm,t(om,t,sm,t)=−(γAΦm,t(sm,t,um,t)+γEem,t(sm,t,um,t)),

(17)where γAΦm,t(sm,t, um,t) + γEem,t(sm,t, um,t) is theobjective function of each device in (15) at each time slot.Note that, Rm,t(om,t, sm,t) increases as the weightedsum of the AoI and energy consumption of device mdecreases, which implies that maximizing the reward ofeach device can minimize the weighted sum of the AoIand energy consumption.

• Individual value function: The individual value functionof each device m is defined as Qm(om,t, sm,t) whichrecords the current local state and action and will betransmitted to the BS for the estimation of global valuefunction. Based on the historical information recorded,each device can select the optimal sampling action sm,tusing its local observation om,t. However, due to theextremely high dimension of the state space with con-tinuous variable Fm,t, it is computationally infeasible

to obtain the optimal actions using the standard finite-state Q-learning algorithm [19]. Hence, we adopt a DQNapproach to approximate the action-value function usinga deep neural network Qm(om,t, sm,t|θm) where θm isused to map the input local observation om,t to the outputaction sm,t.

• Global value function: We define a global value functionQtot(ot,at) that is generated by a mixing network f(·)in the BS to estimate all distributed devices’ achiev-able future rewards at every global environment stateot. Different from value decomposition networks (VDN)[20] in which the global value function is defined asM∑m=1

Qm(om,t, sm,t|θm), we use a mixing network to

estimate the values of Qtot(ot,at) collected from dis-tributed devices. The relationship between the globalvalue function Qtot(ot,at) generated by the BS andQm(om,t, sm,t|θm) generated by each device can begiven by

Qtot(ot,at)

=f(u1,tQ1(o1,t, s1,t|θ1), . . ., uM,tQM,t(oM,t, sM,t|θM ))

=utwt [Q1(o1,t,s1,t|θ1),. . .,QM(oM,t,sM,t|θM)]+bt

=

M∑m=1

um,t(wm,tQm,t (om,t, sm,t|θm)+bm,t)

(18)where f(·) is the mixing network that is used to combineQm(om,t, sm,t|θm) from each device m monotonicallywith wt = [w1,t, . . .,wM,t] and bt = [b1,t, . . .,bM,t]being the weights and the biases of the mixing network,respectively. Here, we need to note that the value ofQm(om,t, sm,t|θm) can only be obtained by the BS asdevice m is selected. Otherwise, without an allocated RB,device m cannot communicate with the BS and hence, thevalue of Qm(om,t, sm,t|θm) at device m cannot be usedto generate Qtot(ot,at) at the BS.

B. QMIX for Optimization of the Sampling Policy

Given the components of the proposed QMIX algorithm,next, we introduce the entire procedure of training the pro-posed distributed QMIX algorithm to find the global optimalsampling policy and device selection schemes. The aim of thetraining process is to minimize the temporal difference (TD)error metric that is defined as followsL(θ1, . . . ,θM )

=E

[(Qtot(ot,at)−Rt(ot,at)− γmax

a′t

Qtot(ot+1,at)

)2],

(19)

wherRt(ot,at) =M∑m=1

um,tRm,t(om,t, sm,t) and γ is the dis-

counted factor. To minimize the TD error defined in (19)in the distributed devices, we can observe the following: a)Given Qm(om,t, sm,t|θm) and um,t, calculating Qtot(ot,at)depends on the mixing network f(·) at the BS and b) GivenQtot(ot,at) and ut, updating Qm,t(om,t, sm,t|θm) only de-pends on the DQN network θm at each device m. According to

Fig. 2. The structure of the QMIX network.

these observations, we can separate the training process of theproposed RL method into two stages: 1) BS training stage and2) IoT device training stage, which can be given as follows:• BS training stage: In this stage, the BS selects a subset of

devices to transmit their sampled information and gener-ate the global value function Qtot(ot,at). In particular,after executing the selected action sm,t, device m requeststo the BS for sampled information transmission. Usingthe current and historical request information from all de-vices, the BS can obtain the global state ot so as to deter-mine the optimal device selection ut based on Theorem1. Given ut, the sampled information of the monitoredphysical process and the value of Qm(om,t, sm,t|θm) arecollected by the BS. To estimate all devices’ achievablefuture rewards, the BS must generate the global valuefunction Qtot(ot,at) using the mixing network that takesQm(om,t, sm,t|θm) and ot as input. Given the weightsand the biases at each linear layer of the mixing network,the BS can generate Qtot(ot,at) using (18).

• IoT device training stage: In this stage, each IoT devicemust decide whether to sample the physical process. Inparticular, at each time slot, each device m observesthe local state om,t and chooses an action sm,t. Theprobability of device m selecting action sm,t is obtainedvia ε greedy method [21], which is given as follows:

sm,t=

argmaxsm,t∈S

Qm(om,t,sm,t|θm), with probability ε,

randint(1, |S|), with probability 1−ε,(20)

where ε is the probability of exploitation, |S| is thenumber of available actions for each device, andrandint(1, |S|) is the random integer function thatuniformly generates an integer ranging from 1 toS. As the monitored physical process is sampledby each device, a transmission request will be sentand thus, the BS can allocate the limited RBs forsampled information transmission and collect the localobservation to generate the global value functionQtot(ot,at). As Qtot(ot,at) is received as feedbackfrom the BS, an experience defined in set Gm =(om,1,am,1,R(om,1,am,1)),...,(om,G,am,G,R(om,G,am,G))

will be recorded by each device m ∈ M. Then, eachdevice selects a random batch gm from Gm to updateits value function so as to accurately estimate futurerewards. The update rule of the individual value functionin each device can be given by

∆θm

=θi+1m −θim

=αm∇θmL(θ1, . . . ,θM )

=αm∇θm

[(Qtot(ot,at)−R(ot,at)−γmax

a′t

Qtot(ot+1,a′t)

)2]=2κm,t∆Q

2tot∇θmQm(om,t,sm,t|θm),

(21)where ∆Qtot=Qtot(ot,at)−R(ot,at)−γmax

a′t

Qtot(ot+1,a′t)

and κm,t=αmum,twm,t with αm being the update stepsize of Qm(om,t,sm,t|θm) in each distributed device m.From (21), we can see that, as device m is allocatedan RB, the BS can transmit the value of Qtot(ot,at)and wm that are generated by the mixing network usingot and Qm(om, sm|θm) to the selected devices, thus,device m can update its own sampling policy by choosinggreedy actions with respect to Qm(om, sm|θm). Oth-erwise, the device cannot communicate with the BS toobtain Qtot(ot,at) and participate in the update with aglobal state information.

The entire process of training the proposed QMIX algorithmis shown in Algorithm 1. At the beginning of the algorithm,each device selects the sampling action using the initial indi-vidual value function Qm(om,t,sm,t|θm). After that, the set ofdevices that sample the monitored physical processes requestRB allocation from the BS. Using the transmission request, theBS can obtain the local observation om,t from each devicem to determine the optimal ut. Then, the BS can collectthe sampled information and the value of Qm(om,t, sm,t|θm)from the selected devices in ut. The collected information ofQm(om,t, sm,t|θm) and om,t will be considered as an inputfor the mixing network in the BS that is used to calculate theglobal value function Qtot(ot,at). Given Qtot(ot,at) fromthe BS, the devices can update their value parameters θmbased on (21) and determine the sampling action with theupdated value functions so as to minimize the weighted sumof AoI and the energy consumption.

C. Convergence and Complexity of the Proposed Algorithm

1) Convergence of the Proposed Algorithm: In this sec-tion, we first prove that the proposed RL method converges.However, we cannot find the exact value that the proposedRL method reaches. Our goal is to show that the proposedalgorithm will not diverge. For this purpose, we first introducethe following definition on the gap between the optimal globalQ-function Q∗(ot,at) and the optimal QMIX Q∗tot(ot,at).

Definition 1. The gap between the optimal global Q-value andthe optimal QMIX value is defined as

ε(ot,at)=Q∗(ot,at)−Q∗tot(ot,at), (22)

Algorithm 1 QMIX methodInput: The environment state O, the sampling action space S.Output: The sampling and device selection policy.1: Initialize θm of each DQN in each IoT device m and the weights

of mixing network in the BS.2: for iteration i = 1 : H do3: for each device m do4: Observe the current local observation om,t.5: Choose an action sm,t according to Q(om,t, sm,t|θm).6: end for7: The BS observes the local state om,t from each device m to

obtain global state ot = [o1,t, . . . ,oM,t].8: Given ot, the BS optimizes the device selection scheme ut

using Theorem 1 and collects the sampled information fromeach selected device m.

9: The BS generates the global value function Qtot(ot,at) andtransmit it to the selected devices.

10: for each selected device m do11: Receive Qtot(ot,at) and records the experience

(om,t,am,t, R(om,t,am,t)) in Gm.12: Choose a random batch g from Gm.13: Update θm using (21).14: end for15: end for

where Q∗(ot,at)=∑o′t

Pat(ot,o

′t)

[R(ot,at)+γmax

a′t∈A

Q∗(o′t,a′t)

]with Pat

(ot,o′t) being the transition probability matrix and

Q∗tot(ot,at) =f(u1,tQ∗1(o1,t,s1,t), . . . , uM,tQ

∗M (oM,t, sM,t))

with Q∗m(om,t, sm,t) being the optimal Qm(om,t, sm,t).From Definition 1, we can see that the relationship between

the optimal value of Q-function and the QMIX approachcan be captured by a constant ε(ot,at) at each (ot,at).Moreover, as verified in [22], the Bellman operator in tra-ditional Q network is a contraction operator with respectto the sup-norm over the globally observable informationO × A. Next, we prove that the Bellman operator in QMIXmethod is also a contraction operator with respect to the sup-norm over partially observable information O1, · · · ,OM ×A1, · · · ,AM, which implies the gap ε(ot,at) will not effectthe contraction of the Bellman operator in QMIX method.

Lemma 1. When I > M , the optimal QMIX function is afixed point of a contraction operator HQtot in the sup-normwith modulus γ, i.e., ‖HQ1

tot−HQ2tot‖∞ 6 γ‖Q1

tot−Q2tot‖∞.

Proof: See Appendix A.

From Lemma 1, we observe that the Bellman operator inQMIX method is γ-contractive in the sup-norm. Such acontraction property constructs a sequence of action-valuefunctions Qitoti>0 where the initialization function Q0

tot isarbitrary. Using Lemma 1, we next prove that the sequenceQitoti>0 will converge as the QMIX algorithm empiricallylearns from a batch of data iteratively.

Theorem 2. The QMIX will converge to Q∗tot(ot,at).

Proof: See Appendix B.

Theorem 2 shows that the QMIX learning method alwaysconverges to Q∗tot(ot,at). In addition, we can also see thatthe gap ε(ot,at) that depends on the initialized weights of

the neural network affects the performance of the samplingpolicy, thus, the sampling policy st obtained by the QMIXalgorithm may not be the optimal solution in (15).

2) Complexity of the Proposed Algorithm: Next, we presentthe complexity for the optimization of device selection schemeand the training the QMIX algorithm that consists of dis-tributed DQNs in each IoT device and a mixing network inthe BS, which is detailed as follows.

a) First, we explain the complexity of optimizing the deviceselection, which lies in obtaining the optimal u∗t . Accordingto Theorem 1, the complexity of optimizing device selectionis O(MT ).

b) In terms of complexity of training distributed DQNs,each device needs to update its own RL parameter θm thatis used to determine the sampling action sm,t based on thelocal observation om. Hence, the time-complexity of traininga DQN that is a fully connected network depends on thedimension of input om,t and output sm,t, as well as the numberof the neurons in each hidden layer [23]. Let li denote thenumber of neurons of any given hidden layer i and LD denotethe number of hidden layers in the DQN of device m, the time-

complexity of the DQN is O(LD∑i=1

lili+1 + |om|l1 + |sm|LD)

where |om| and |sm| represent the dimension of om,t and sm,t,respectively.

c) In terms of complexity of training the mixing network,the BS first needs to generate the weights and the biases ofthe mixing network that is a feed-forward neural network.Hence, the time-complexity of the mixing network lies in thegeneration of the weights and the biases of each layer. Inparticular, the weights of each layer in the mixing network isgenerated by separate hypernetworks [24] that consists of twofully-connected layers with a ReLU nonlinearity, followed byan absolute activation function, to ensure the non-negativityof the mixing network weights. Moreover, the biases areproduced in the same manner but are not restricted to beingnon-negative. Hence, the time-complexity of generating theparameters in the mixing network is O(|ot|LW|wt|+LB|bt|)[25] where LW and LB denote the number of the neurons toproduct the weights and the biases, respectively. |wt| and |bt|represent the dimension of wt and bt, respectively.

IV. SIMULATION RESULTS AND ANALYSIS

In our simulations, we consider a circular network areahaving a radius r = 100 m. One BS is located at the centerof the network area and M = 20 IoT devices are uniformlydistributed. The data used to model real-time dynamics isobtained from the Center for Statistical Science at PekingUniversity [26]. Table II defines the values of other parameters.A uniform sampling policy and the traditional fully distributedDQN method are considered for comparison.

Fig. 3 shows how the value of the reward function changesas the total number of iterations varies. In Fig. 3, the lineand the shadow are the mean and the standard deviationcomputed for 20 users with 10 RBs. Due to the limited RBs,the BS cannot collect Qm(om, sm|θm) from all devices at

TABLE IISIMULATION PARAMETERS [27]

Parameters Values Parameters ValuesM 20 τ 1 sI 10 σ2

N -95 dBmW 180 kHz ξm 10 HzPT 0.5 W γA 0.5CS 0.5 mJ γE 0.5φmax 5 Φmax 5Zm 10 bit r 100

Fig. 3. Value of the reward function as the total number of iterationsvaries.

each time slot. To investigate how the constraint on the numberof RBs affects the performance of the QMIX algorithm,we compare the proposed QMIX with partial information ofQm(om, sm|θm) from selected devices to the QMIX withglobal information of Qm(om, sm|θm) from all devices inthe same system. From Fig. 3, we can see that, compared tothe traditional DQN algorithm, the proposed QMIX algorithmcan achieve better performance at the beginning of the trainingprocess. This is because that at the beginning of the trainingprocess, the BS can collect the local observation at eachdevice to generate the global value function that enables eachdevice to quickly adjust the sampling policy based on the localstate information, which results in a rapid improvement atthe beginning of the training process. However, the proposedQMIX approach achieves a 38.1% loss in terms of the numberof iterations needed to converge compared to the QMIXalgorithm with global information. This is due to the factthat, under a limited number of RBs, the BS can only collecta subset of Qm(om, sm|θm) from the selected devices togenerate the global value function. With partial information,the proposed algorithm cannot capture the relationship be-tween the sampling policies of all devices during one iteration,thus decreasing convergence speed. Fig. 3 also shows thatthe proposed algorithm can achieve up to 24.4% gains interms of the weighted sum of AoI and energy consumptioncompared with the DQN algorithm. This implies that theproposed algorithm enables the devices to cooperatively trainthe learning models based on the estimation on the strategyoutcomes, thus improving the performance of the samplingpolicies of the distributed devices.

In Fig. 4, we show how the sum of the AoI and the total

10 15 20 25 30 35 40

Number of users

0

10

20

30

40

50

60Su

m o

f A

oI (

s)Sampling policy based on QMIXSampling policy based on DQNUniform sampling policy

(a) The sum of AoI vs. the number of IoT devices.

10 15 20 25 30 35 40

Number of users

3

3.5

4

4.5

5

5.5

Tot

al e

nerg

y co

nsum

ptio

n (m

J)

Sampling policy based on QMIXSampling policy based on DQNUniform sampling policy

(b) The total energy consumption vs. the number of IoT devices.

Fig. 4. The sum of AoI and the total energy consumption as thenumber of IoT devices varies.

energy consumption of all devices change as the number ofedge devices varies. From Fig. 4(a), we can see that, the sumAoI increases as the number of devices increases. This is dueto the fact that, the number of RBs is limited in the consideredsystem and, hence, as the number of devices increases, somedevices may not be able to sample and transmit their moni-tored information to the BS immediately, thus resulting in anincrease of the sum of AoI. Moveover, the sum AoI increasesrapidly as the number of devices continues to increase. Thisis because, as the number of RBs is much smaller than thenumber of devices, most of the devices must wait until beingallocated the RB so as to update the sampled informationwhich results in a great growth in terms of AoI. In Fig. 4(a),we can also see that the proposed algorithm can reduce thesum of the AoI by up to 17.8% and 33.9% compared to thesampling policy based on DQN and the uniform samplingpolicy, respectively, for the case with 10 RBs and 40 devices.This gain stems from the fact that the proposed algorithmenables the BS to observe the global state information so as togenerate the global value function that enables each distributeddevice to achieve a better performance of the sample policy.From Fig. 4(b), we can see that, as the number of devicesincreases, the total energy consumption increases. This is due

10 15 20 25 30 35 40

Number of users

1

1.5

2

2.5

3

3.5

4

4.5

5

Ave

rage

sam

ple

inte

rval

(s)

Sampling policy based on QMIXSampling policy based on DQNUniform sampling policy

(a) The average sample interval vs. the number of IoT devices.

10 15 20 25 30 35 40

Number of users

0

0.5

1

1.5

2

2.5

3

3.5

Ave

rage

que

ue d

elay

(s)

The sampling policy based on QMIXThe sampling policy based on DQNThe uniform sampling policy

(b) The average queue delay vs. the number of IoT devices.

Fig. 5. The average sample interval and the average queue delaychanges as the number of IoT devices varies.

to the fact that, as the number of devices increases, the numberof devices that must sample the physical process and transmitthe sampled information to the BS increases, and, hence, thetotal energy consumption for status sampling and uploadingincreases. Fig. 4(b) also shows that, when the number ofdevices is larger than 20, the total energy consumption ofall algorithms remains nearly constant because of the limitednumber of available RBs. From Fig. 4(b), we can see that theproposed algorithm can reduce the total energy consumptionby up to 13.2% and 35.1% compared to the sampling policybased on DQN and the uniform sampling policy for the casewith 10 RBs and 10 devices. This gain stems from the factthat the proposed algorithm enables the BS to collect theinformation of the AoI and the physical dynamics from thedistributed devices, thus being able to select less devices toreduce the sum of AoI and capture the variation of the physicalprocess with less energy.

Fig. 5 shows how the average sample interval and theaverage queue delay changes as the number of IoT devicesvaries. Clearly, from Fig. 5(a), we can see that, as the numberof devices increases, the average sample interval increases.This is because the number of RBs is limited and, hence, as thenumber of devices increases, the probability that each devicecan be allocated a transmission opportunity to upload the

sampled status information decreases. In consequence, eachdevice must increase the sample interval so as to decreasethe number of sampled packets for energy saving. Fig. 5(a)also shows that the average sample intervals of all algorithmsare essentially identical. This is because that the devices areused to monitor different physical processes which changes astime elapses. For this purpose, all algorithms must fully utilizethe limited energy and RBs for status sampling and uploadingand hence, the average sample interval of all algorithms arebasically same. From Fig. 5(b), we can see that, as the numberof users increases, the average queue delay of each sampledpacket increases. This stems from the fact that, as the numberof users increases, the BS can not be able to allocate limitedRBs to all devices immediately, thus resulting in additionalqueue delay of each sampled packet. Fig. 5(b) also shows thatthe proposed algorithm achieves up to 29.3% and 15.9% gainsin terms of the average queue delay compared to the uniformsampling policy and the sampling policy based on DQN. Thisis because that the proposed algorithm enables the devices tolearn the sampling policies from each other by using the globalvalue function generated by the BS and hence, cooperativelysampling the physical process and reducing the queue delay.

Fig. 6 shows an example of the estimation of the physicalprocess with different number of users. In this figure, we cansee that, as the physical process varies rapidly, the estimationerror of PM 2.5 increases. The reason is that, as the indexof the PM 2.5 changes rapidly, each device must increase thesampling frequency so as to collect more status informationto capture the variation of the physical process. However,with limited RBs, each device may not be able to transmitthese sampled packet to the BS immediately and hence, theestimation error of the proposed approach increases. Fig. 6also shows that, as the index of the PM 2.5 changes slowly,the estimation error will not be reduced to zero, especially forthe case with 40 devices. This stems from the fact that, asthe number of devices is much lager than the number of RBs,the devices must cooperatively sample the different monitoredphysical process. Thus, as the index of the PM 2.5 changesslowly, each device will not sample this physical process so asto save the energy and the limited RBs that can be allocatedto other devices with fast changing physical processes. FromFig. 6, we can also see that the proposed QMIX approachachieves better estimation accuracy compared to the samplingpolicy based on DQN and the uniform sampling policy. Thisimplies that the proposed QMIX approach enables the devicesto cooperatively train the sampling policy so as to monitor thephysical process accurately.

Fig. 7 shows how the estimation error changes as thenumber of users varies. Clearly, as the number of usersincreases, the estimation error increases. This is because thatthe number of RBs is limited in the considered system. Thus,as the number of users increases, the probability that eachdevice can be allocated RBs for status uploading will decrease.In consequence, each device must increase the sample intervalto ensure that the sampled packet can be transmit timely, whichresults in an increase of estimation error. Fig. 7 also shows that,as the number of devices continues to increase, the estimation

0 10 20 30 40 50 60 70 80 90 100

Time slot

0

50

100

150

200

250

300

350

400

450

500

Inde

x of

PM

2.5

(g/

M3 )

Dynamic physical processSampling policy based on QMIXSampling policy based on DQNUniform sampling policy

(a) 10 users with 10 RBs

0 10 20 30 40 50 60 70 80 90 100

Time slot

0

50

100

150

200

250

300

350

400

450

500

Inde

x of

PM

In2

.5 (

g/M

3 )


(b) 20 users with 10 RBs

0 10 20 30 40 50 60 70 80 90 100

Time slot

0

50

100

150

200

250

300

350

400

450

500

Inde

x of

PM

2.5

(g/

M3 )


(c) 30 users with 10 RBs

0 10 20 30 40 50 60 70 80 90 100

Time slot

0

50

100

150

200

250

300

350

400

450

500

Inde

x of

PM

2.5

(g/

M3 )


(d) 40 users with 10 RBsFig. 6. The estimation of the dynamics of the physical process basedon different algorithms.

10 15 20 25 30 35 40

Number of users

0

500

1000

1500

2000

2500E

stim

atio

n er

ror

of P

M 2

.5 (

g/M

3 ) Sampling policy based on QMIXSampling policy based on DQNUniform sampling policy

Fig. 7. The estimation error of the dynamic physical process as thenumber of users varies.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Total energy consumption (mJ)

5

10

15

20

25

30

35

40

45

50

55

Sum

of

the

AoI

(s)

Sampling policy based on QMIXSampling policy based on DQN

E= 0.8

A= 0.2

E= 0.9

A= 0.1

E= 0.5

A= 0.5

E= 0.7

A= 0.3

Fig. 8. The minimum of the energy consumption and the AoI of alldevices change as the scaling parameters vary.

error increases slowly. This is due to the fact that as thenumber of devices is much larger than the number of RBs,the BS cannot collect the sampled packet from each deviceimmediately and, as a result, the estimation error increasesslowly since the outdated sampled packet are ineffective formonitoring physical processes. From Fig. 7, we can also seethat the proposed algorithm reduces the estimation error by upto 19.4% and 33.8% compared to the sampling policy basedon DQN and the uniform sampling policy. This gain stemsfrom the fact that the proposed algorithm enables the devicesto cooperatively adjust their sampling policies based on theglobal state information collected by the BS thus achieving abetter monitoring performance.

Fig. 8 shows how the minimum of the energy consumptionand the AoI of all devices change as the scaling parametersvary. In this figure, each point represents the minimum energyconsumption and the AoI achieved by the considered algo-rithms under the given scaling parameters γA and γE. Here,as γE decreases while γA increases, the minimum AoI ofall devices decreases. This is because as γE decreases andγA increases, the considered algorithms will focus more onthe minimization of the AoI of all devices. Thus, the IoTdevices must increase the sampling frequency to capture thevariation of the physical process so as to minimize their AoI.

In Fig. 8, we also see that, the proposed QMIX algorithmachieves a better performance. This stems from the fact that,the proposed QMIX algorithm enables the distributed devicesto adjust their sampling policies by using the global stateinformation collected by the BS. Fig. 8 also shows that, as γEdecreases while γA increases, the gap between the minimumenergy consumption and the AoI achieved by DQN and theQMIX approach decreases since the number of RBs that theconsidered algorithms can optimize is limited.

V. CONCLUSION

In this paper, we have considered a real-time IoT system tocapture the variation of a physical process. We have formulatedan optimization problem that seeks to adjust the samplingpolicy of each distributed device and the device selectionscheme of the BS so as to minimize the weighted sum of theAoI and total device energy consumption of all devices. Tosolve this problem, we have developed a distributed QMIXalgorithm that enables edge devices to cooperatively updatethe RL parameters using the global value function generatedby the BS based on the observed global state informationthus, improving the performance of the sampling policy. Giventhe sampling policy, we have optimized the device selectionscheme to minimize the weighted sum of AoI and energyconsumption of all devices. Simulation results have shownthat the proposed approach yields significant gains comparedto conventional approaches.

APPENDIX

A. Proof of Theorem 1

To determine the optimal device selection scheme ut, wefirst need to build the relationship between ut and the samplingpolicy st. For this purpose, the AoI of devices in (12) can berewritten as

γAΦt(st,ut) + γEet(st,ut)

=1

t

t∑i=1

E

[M∑m=1

γA(Φm,i(sm,i, um,i)+γEem,i(sm,i, um,i))

]

=1

t

t∑i=1

M∑m=1

E [γA[(1−um,i)(Φm,i−1+τ)+um,i(lm,i(um,i)

+φm,t(sm,i))]+γE[sm,iCS+PT lm,i(um,i)]]

=1

t

t∑i=1

M∑m=1

[γA(1−um,i)(Φm,i−1+τ)+um,i(Dm,i+φm,i(sm,i))

+γE [sm,iCS+PTDm,i]](23)

where Dm,i(um,i) = E [lm,i(um,i)] and Φm,i−1 is the simpli-fied notation for Φm,i−1(sm,i−1, um,i−1). According to (23),

the weighted sum of AoI and energy consumption of devicem at time slot t can be written by

γAΦm,t(sm,t, um,t) + γEem,t(sm,t, um,t)

=γA[(1−um,t)(Φm,t−1+τ)+um,t(Dm,t+φm,t)]

+γE (sm,tCS+PTDm,t)

=um,t[γA(Dm,t+φm,t−Φm,t−1−τ)+γEPTDm,t]

+γEsm,tCS+γA(Φm,t−1+τ)

=Cm,t,1um,t+Cm,t,2,(24)

where Dm,t and φm,t are short for Dm,t(um,t) and φm,t(sm,t),respectively. Cm,t,1=γA(Dm,t(1)+φm,t(sm,t)−Φm,t−1−τ)+γEPTDm,t(um,t) and Cm,t,2 = γEsm,tCS +γA(Φm,t−1+τ)with φm,t(sm,t) = sm,tmax0, δ(t) − ∆m,t + (1 −sm,t)minφm,t−1+τ,φmax).

Given the sampling policy st, the device selection problemat time t can be written as

minum,t

M∑m=1

(Cm,t,1um,t + Cm,t,2) (25)

s.t.∑m∈M

um,t 6 I, (25a)

um,t ∈ 0, 1, ∀m ∈M. (25b)

Denote the net setM1 = m ∈M|Cm,t,1 < 0. If |M1| 6I , we have

u∗m,t =

1, if m ∈M1

0, otherwise.(26)

B. Proof of Lemma 1

In order to prove Lemma 1, we first need to buildthe relationship between the Bellman operator HQ intraditional Q network over global observable informa-tion and the Bellman operator HQtot in QMIX overpartially observable information. As shown in [22], theBellman operator in traditional Q network over globalobservable information is defined as (HQ)(ot,at) =∑o′t

Pat(ot,o

′t)

[R(ot,at)+γ max

a′t∈A

(Q∗(o′t,a′t))

]. Using Def-

inition 1, the Bellman operator in QMIX method can beexpressed as

Q∗tot(ot,at)

=Q∗(ot,at)−ε(ot,at)

=∑o′t

Pat(ot,o′t)

[R(ot,at)+γ max

a′t∈A

(Q∗(o′t,a′t))

]−ε(ot,at)

=∑o′

Pat(ot,o′t)

[R(ot,at)+γ

(f

(maxa′

1,t∈AQ∗1(o

′1,t,a

′1,t), . . .,

maxa′

M,t∈AQ∗M (o′M,t,a

′M,t)

)+ε(o′t,a

′t)

)−ε(ot,at)

],

(27)where the last equation follows from the fact that the non-negativity of the mixing network weights and the availabilityof Qm(om,t,sm,t) of each device m at each time slot (i.e.,

um,t=1). Thus, the optimal QMIX function is defined for ageneric function Qtot :O1, · · · ,OM×A1, · · · ,AM→R as

(HQtot)(ot,at)=∑o′t

Pat(ot,o

′t)

[R(ot,at)− ε(ot,at)

+γ

(f

(maxa′

1,t∈AQ∗1(o

′1,t,a

′1,t),. . . ,max

a′M,t∈A

Q∗M(o′M,t,a

′M,t)

)+ε(o′t,a

′t)

)].

(28)Next, we prove that HQtot is a contraction in the sup-norm.Based on (28), we have:

‖HQ1tot −HQ2

tot‖∞

=maxot,at

∣∣∣∣∣∣∑o′t

Pa(ot,o′t)γ(Q1

tot(o′t,a′t)−Q2

tot(o′t,a′t))∣∣∣∣∣∣

=maxot,at

∑o′t

Pat(ot,o′t)

∣∣∣∣γ maxa′

t∈A

(Q1

tot(o′t,a′t)−Q2

tot(o′t,a′t))∣∣∣∣

6maxot,at

γ∑o′t

Pat(ot,o

′t)maxo′t,a

′t

∣∣Q1tot(o

′t,a′t)−Q2

tot(o′t,a′t)∣∣

=maxot,at

γ∑o′t

Pat(ot,o

′t)‖Q1

tot −Q2tot‖∞

=γ‖Q1tot −Q2

tot‖∞(29)

This completes the proof.

C. Proof of Theorem 2

Using the definition of HQtot in Lemma 1, the update ruleof QMIX method that introduces ε(ot,at) can be written by

Qi+1tot (ot,at)←(1−α)Qitot(ot,at)

+α

(R(ot,at)+γ max

a′t∈A

(Qitot(o

′t,a′t)+ε(o

′t,a′t))−ε(ot,at)

).

(30)Subtracting Q∗tot(ot,at) from both sides in (30) and lettingΛi = Q∗tot(ot,at)−Qitot(ot,at), we have

Λi+1 ← (1−α)Λi+αF i(ot,at) (31)

where F i(ot,at)=R(ot,at)+γmaxa′∈A

(Qitot(o

′t,a′t)+ε(o

′,a′))

−ε(ot,at)−Q∗tot(ot,at). Based on [28], the random processΛi converges to 0 under the following conditions:

a) E[F i(ot,at)|F i

]6γ‖Λi‖∞;

b) var[F i(ot,at)|F i

]6 C(1 + ‖Λi‖2∞).

Next, we prove that the random process Λi in QMIXmethod satisfies a) and b), respectively.

For a), we have

E[F i(ot,at)|F i

]=∑o′

Pa(ot,o′t)

[R(ot,at)+γ max

a′∈A

(Qitot(o

′t,a′t)+ε(o

′t,a′t))

−ε(ot,at)−Q∗tot(ot,at)]

=(HQtot)(ot,at)−(HQ∗tot)(ot,at)6γ‖Qtot −Q∗tot‖∞=γ‖Λ‖∞,

(32)

where the last equation stems from the fact Q∗tot = HQ∗tot andthe last inequality follows from Lemma 1.

For b), we havevar[F i(ot,at)|F i

]=E[(R(ot,at)+γmax

a′t∈A

(Qitot(o

′t,a′t)+ε(o


−Q∗tot(ot,at)−(HQtot)(ot,at)+Q∗tot(ot,at))

2]

=E[(R(ot,at)+γ max

a′∈A

(Qitot(o

′t,a′t)+ε(o


−(HQtot)(ot,at))2]

=var

[R(ot,at)+γ max

a′t∈A

(Qitot(o

′t,a′t)+ε(o

′t,a′t))−|F

]6C(1 + ‖Λi‖2∞),

(33)where C is a constant and the last inequality stems from thefact that R(ot,at), ε(ot,at), and ε(o′t,a

′t) are bounded. From

(32) and (33), we can see that, the proposed QMIX algorithmsatisfies the conditions in a) and b), hence, Qtot(ot,at)reaches Q∗tot(ot,at) as Λi = 0. This completes the proof.

REFERENCES

[1] B. Zhou and W. Saad, “Joint status sampling and updating for minimiz-ing age of information in the Internet of Things,” IEEE Transactionson Communications, vol. 67, no. 11, pp. 7468–7482, Mar. 2019.

[2] T. Park, W. Saad, and B. Zhou, “Centralized and distributed age ofinformation minimization with nonlinear aging functions in the Internetof things,” IEEE Internet of Things Journal, vol. 8, no. 10, pp. 8437–8455, May. 2021.

[3] C. Xu, H. H. Yang, X. Wang, and T. Q. S. Quek, “Optimizinginformation freshness in computing-enabled IoT networks,” IEEEInternet of Things Journal, vol. 7, no. 2, pp. 971–985, Feb. 2020.

[4] H. Zhang, Y. Kang, L. Song, Z. Han, and H. Vincent Poor, “Age ofinformation minimization for grant-free non-orthogonal massive accessusing mean-field games,” IEEE Transactions on Communications, toappear, 2021.

[5] S. Hao and L. Duan, “Regulating competition in age of informationunder network externalities,” IEEE Journal on Selected Areas inCommunications, vol. 38, no. 4, pp. 697–710, Apr. 2020.

[6] X. Zheng, S. Zhou, Z. Jiang, and Z. Niu, “Closed-form analysis of non-linear age of information in status updates with an energy harvestingtransmitter,” IEEE Transactions on Wireless Communications, vol. 18,no. 8, pp. 4129–4142, Aug. 2019.

[7] B. Li, Q. Wang, H. Chen, Y. Zhou, and Y. Li, “Optimizing informationfreshness for cooperative IoT systems with stochastic arrivals,” in IEEEInternet of Things Journal, to appear, 2021.

[8] M. A. Abd-Elmagid, H. S. Dhillon, and N. Pappas, “A reinforcementlearning framework for optimizing age of information in RF-poweredcommunication systems,” IEEE Transactions on Communications, toappear, 2020.

[9] H. B. Beytur and E. Uysal, “Age minimization of multiple flowsusing reinforcement learning,” in Proc. International Conference onComputing, Networking and Communications, HI, USA, Feb. 2019.

[10] M. Ma and V. W. S. Wong, “Age of information driven cache contentupdate scheduling for dynamic contents in heterogeneous networks,”IEEE Transactions on Wireless Communications, to appear, 2020.

[11] A. Elgabli, H. Khan, M. Krouka, and M. Bennis, “Reinforcementlearning based scheduling algorithm for optimizing age of informationin ultra reliable low latency networks,” in Proc. IEEE Symposium onComputers and Communications, Barcelona, Spain, Jun. 2019.

[12] A. Ferdowsi, M. A. Abd-Elmagid W. Saad, and H. S. Dhillon, “Neuralcombinatorial deep reinforcement learning for age-optimal joint trajec-tory and scheduling design in UAV-assisted networks,” IEEE Journalon Selected Areas in Communications, to appear, 2021.

[13] M. Li, C. Chen, H. Wu, X. Guan, and S. Shen, “Age-of-informationaware scheduling for edge-assisted industrial wireless networks,” IEEETransactions on Industrial Informatics, to appear, 2020.

[14] J. Hu, H. Zhang, L. Song, R. Schober, and H. V. Poor, “CooperativeInternet of UAVs: Distributed trajectory design by multi-agent deepreinforcement learning,” IEEE Transactions on Communications, vol.68, no. 11, pp. 6807–6821, Aug. 2020.

[15] J. Hu, H. Zhang, K. Bian, L. Song, and Z. Han, “Distributed trajectorydesign for cooperative internet of UAVs using deep reinforcementlearning,” in Proc. IEEE Global Communications Conference, HI, USA,Dec. 2019.

[16] P. Zhang, Y. Yuan, H. Yang, and H. Liu, “Near-Nash equilibrium controlstrategy for discrete-time nonlinear systems with round-robin protocol,”IEEE Transactions on Neural Networks and Learning Systems, vol. 30,no. 8, pp. 2478–2492, Aug. 2019.

[17] M. Xiao, “A direct method for the construction of nonlinear discrete-time observer with linearizable error dynamics,” IEEE Transactions onAutomatic Control, vol. 51, no. 1, pp. 128–135, 2006.

[18] Z. Wei, B. Li, and W. Guo, “Optimal sampling for dynamic complexnetworks with graph-bandlimited initialization,” IEEE Access, vol. 7,pp. 150294–150305, Oct. 2019.

[19] S. Wang, M. Chen, X. Liu, C. Yin, S. Cui, and H. V. Poor, “Amachine learning approach for task and resource allocation in mobileedge computing based networks,” IEEE Internet of Things Journal, vol.8, no. 3, pp. 1358–1372, Feb. 2021.

[20] Y. Hu, M. Chen, W. Saad, H. V. Poor, and S. Cui, “Distributed multi-agent meta learning for trajectory design in wireless drone networks,”Available Online: https://arxiv.org/abs/2012.03158, Dec. 2020.

[21] Y. Zhou, F. Zhou, Y. Wu, R. Q. Hu, and Y. Wang, “Subchannel assigmentbased on Q-learning in wideband cognitive radio networks,” IEEETransactions on Vehicular Technology, vol. 69, no. 1, pp. 1168–1172,Jan. 2020.

[22] J. Fan, Z. Wang, Y. Xie, and Z. Wang, “A theoretical analysis of deepQ-Learning,” Available Online: https://arxiv.org/abs/1901.00137v3, Feb.2020.

[23] Y. Wang, M. Chen, Z. Yang, T. Luo, and W. Saad, “Deep learningfor optimal deployment of UAVs with visible light communications,”IEEE Transactions on Wireless Communications, vol. 19, no. 11, pp.7049–7063, Nov. 2020.

[24] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” in Proc. InternationalConference on Learning Representations, Toulon, France, Apr. 2017.

[25] P. Yang, Y. Xiao, M. Xiao, Y. Guan, S. Li, and W. Xiang, “Adaptivespatial modulation mimo based on machine learning,” IEEE Journal onSelected Areas in Communications, vol. 37, no. 9, pp. 2117–2131, July.2019.

[26] B. Guo, B. Li, S. Zhang, and H. Huang, “Assessing Beijing’s PM 2.5pollution: Severity, weather impact, apec and winter heating,” AvailableOnline: https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data/.

[27] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A jointlearning and communications framework for federated learning overwireless networks,” IEEE Transactions on Wireless Communications,vol. 20, no. 1, pp. 269–283, Jan. 2021.

[28] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence ofstochastic iterative dynamic programming algorithms,” Neural Compu-tation, vol. 6, no. 6, pp. 1185–1201, Nov. 1994.

Distributed Reinforcement Learning for Age of Information ...

Documents

Transcript of Distributed Reinforcement Learning for Age of Information ...