Towards Self-Driving Processes: A Deep Reinforcement Learning Approach to Control · 2021. 1....

Towards Self-Driving Processes: A Deep Reinforcement Learning Approach toControl

Steven Spielberga, Aditya Tulsyana, Nathan P. Lawrenceb, Philip D Loewenb, R. Bhushan Gopalunia,∗

aDepartment of Chemical and Biological Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.bDepartment of Mathematics, University of British Columbia, Vancouver, BC V6T 1Z2, Canada.

Abstract

Advanced model-based controllers are well established in process industries. However, such controllers require

regular maintenance to maintain acceptable performance. It is a common practice to monitor controller per-

formance continuously and to initiate a remedial model re-identification procedure in the event of performance

degradation. Such procedures are typically complicated and resource-intensive, and they often cause costly inter-

ruptions to normal operations. In this paper, we exploit recent developments in reinforcement learning and deep

learning to develop a novel adaptive, model-free controller for general discrete-time processes. The DRL controller

we propose is a data-based controller that learns the control policy in real time by merely interacting with the

process. The effectiveness and benefits of the DRL controller are demonstrated through many simulations.

Keywords: process control; model-free learning; reinforcement learning; deep learning; actor-critic networks

Introduction

Industrial process control is a large and diverse field; its broad range of applications calls for a correspondingly

wide range of controllers—including single and multi-loop PID controllers, model predictive controllers (MPCs),

and a variety of nonlinear controllers. Many deployed controllers achieve robustness at the expense of performance.

The overall performance of a controlled process depends on the characteristics of the process itself, the controller’s

overall architecture, and the tuning parameters that are employed. Even if a controller is well-tuned at the time

of installation, drift in process characteristics or deliberate set-point changes can cause performance to deteriorate

over time1,2,3. Maintaining system performance over the long-term is essential. Unfortunately, it is typically also

both complicated and expensive.

Most modern industrial controllers are model-based, so good performance calls for a high-quality process

model. It is standard practice to continuously monitor system performance and initiate a remedial model re-

identification exercise in the event of performance degradation. Model re-identification can require two weeks or

more4, and typically involves the injection of external excitations5, which introduce an expensive interruption

∗Corresponding author. Tel:+1 604 827 5668Email address: [email protected] (R. Bhushan Gopaluni)

Preprint submitted to AIChE Journal June 1, 2019

to the normal operation of the process. Re-identification is particularly complicated for multi-variable processes,

which require a model for every input-output combination.

Most classical controllers in industry are linear and non-adaptive. While extensive work has been done in

nonlinear adaptive control6,7, it has not yet established a significant position in the process industry, beyond

several niche applications8,9. The difficulty of learning (or estimating) reliable multi-variable models in an online

fashion is partly responsible for this situation. Other contributing factors include the presence of hidden states,

process dimensionality, and in some cases computational complexity.

Given the limitations of existing industrial controllers, we seek a new design that can learn the control policy

for discrete-time nonlinear stochastic processes in real time, in a model-free and adaptive environment. This paper

continues our recent investigations10 on the same topic. The idea of RL has been around for several decades;

however, its application to process control has been somewhat recent. Next, we provide a short introduction to

RL and its methods and also highlight existing RL-based approaches for control problems.

Reinforcement Learning and Process Control

Reinforcement Learning (RL) is an active area of research in artificial intelligence. It originated in computer sci-

ence and operations research to solve complex sequential decision-making problems11,12,13,14. The RL framework

comprises an agent (e.g., a controller) interacting with a stochastic environment (e.g., a plant) modelled as a

Markov decision process (MDP). The goal in an RL problem is to find a policy (or feedback controller) that is

optimal in a certain sense15.

Over the last three decades, several methods, including dynamic programming (DP), Monte Carlo (MC) and

temporal-difference learning (TD) have been proposed to solve the RL problem11. Most of these compute the

optimal policy using policy iteration. This is an iterative approach in which every step involves both policy

estimation and policy improvement. The policy estimation step aims at making the value function consistent

with the current policy; the policy improvement step makes the policy greedy with respect to the current estimate

of the value function. Alternating between the two steps produces a sequence of value function estimates and sub-

optimal policies which, in the limit, converge to the optimal value function and the optimal policy, respectively11.

The choice of a solution strategy for an RL problem is primarily driven by the assumptions on the environ-

ment and the agent. For example, under the perfect model assumption for the environment, classical dynamic

programming methods for Markov Decision Problems with finite state and action spaces) provide a closed-form

solution to the optimal value function, and are known to converge to the optimal policy in polynomial time13,16.

Despite their strong convergence properties, however, classical DP methods have limited practical applications be-

2

cause of their stringent requirement for a perfect model of the environment, which is seldom available in practical

problems.

Monte Carlo algorithms belong to a class of approximate RL methods that can be used to estimate the value

function using experiences (i.e., sample sequences of states, actions, and rewards accumulated through the agent’s

interaction with an environment). MC algorithms offer several advantages over DP. First, MC methods allow an

agent to learn the optimal behaviour directly by interacting with the environment, thereby eliminating the need

for an exact model. Second, MC methods can be focused, to estimate value functions for a small subset of the

states of interest rather than evaluating them for the entire state space, as with DP methods. This significantly

reduces the computational burden, since value function estimates for less relevant states need not be updated.

Third, MC methods may be less sensitive to the violations of the Markov property of the system. Despite the

advantages mentioned above, MC methods are difficult to implement in real time, as the value function estimates

can only be updated at the end of an experiment. Further, MC methods are also known to exhibit slow

convergence as they do not bootstrap, i.e., they do not update their value function from other value estimates11.

Finally, the efficacy of MC methods in RL remains unsettled and is a subject of ongoing research11.

TD learning is another class of approximate methods for solving RL problems that combine ideas from DP

and MC. Like MC algorithms, TD methods can learn directly from raw experiences without requiring a model;

and like DP, TD methods update the value function in real time without having to wait until the end of the

experiment. For a detailed exposition on RL solutions, the reader is referred to11 and the references therein.

Reinforcement Learning has achieved remarkable success in robotics17,18, computer games19,20, online adver-

tising21, and board games22,23; however, its adaptation to process control has been limited (see Badgwell et al. 24

for recent survey of RL methods in process control) —even though many optimal scheduling and control problems

can be formulated as MDPs25. This is primarily due to lack of efficient RL algorithms to deal with infinite MDPs

(i.e., MDPs with continuous state and action spaces) that define most modern control systems. While existing

RL methods apply to infinite MDPs, exact solutions are possible only in special cases, such as in linear quadratic

(Gaussian) control problem, where DP provides a closed-form solution26,27. It is plausible to discretize infinite

MDPs and use DP to estimate the value function; however, this leads to an exponential growth in the computa-

tional complexity with respect to the states and actions, which is referred to as curse of dimensionality13. The

computational and storage requirements for discretization methods applied to most problems of practical interest

in process control remain unwieldy even with todays computing hardware28.

The first successful implementation of RL in process control appeared in the series of papers published in the

early 2000s25,28,29,30,31,32, where the authors proposed approximate dynamic programming (ADP) for optimal

3

control of discrete-time nonlinear systems. The idea of ADP is rooted in the formalism of DP but uses simulations

and function approximators (FAs) to alleviate the curse of dimensionality. Other RL methods based on heuristic

dynamic programming (HDP)33, direct HDP34, dual heuristic programming35, and globalized DHP36 have also

been proposed for optimal control of discrete-time nonlinear systems. RL methods have also been proposed for

optimal control of continuous-time nonlinear systems37,38,39. However, unlike discrete-time systems, controlling

continuous-time systems with RL has proven to be considerably more difficult and fewer results are available26.

While the contributions mentioned above establish the feasibility and adaptability of RL in controlling discrete-

time and continuous-time nonlinear processes, most of these methods assume complete or partial access to process

models25,28,29,30,31,38,39. This limits existing RL methods to processes for which high-accuracy models are either

available or can be derived through system identification.

Recently, several data-based approaches have been proposed to address the limitations of model-based RL in

control. In Lee and Lee 32 , Mu et al. 40 , a data-based learning algorithm was proposed to derive an improved

control policy for discrete-time nonlinear systems using ADP with an identified process model, as opposed to an

exact model. Similarly, Lee and Lee 32 proposed a Q-learning algorithm to learn an improved control policy in

a model-free manner using only input-output data. While these methods remove the requirement for having an

exact model (as in RL), they still present several issues. For example, the learning method proposed in Lee and

Lee 32 , Mu et al. 40 is still based on ADP, so its performance relies on the accuracy of the identified model. For

complex, nonlinear, stochastic systems, identifying a reliable process model may be nontrivial as it often requires

running multiple carefully designed experiments. Similarly, the policy derived from Q-learning in Lee and Lee 32

may converge to a sub-optimal policy as it avoids adequate exploration of the state and action spaces. Further,

calculating the policy with Q-learning for infinite MDPs requires solving a non-convex optimization problem over

the continuous action space at each sampling time, which may render it unsuitable for online deployment for

processes with small time constants or modest computational resources. Note that data-based RL methods have

also been proposed for continuous-time nonlinear systems41,42. Most of these data-based methods approximate

the solution to the Hamilton-Jacobi-Bellman (HJB) equation derived for a class of continuous-time affine nonlinear

systems using policy iteration. For more information on RL-based optimal control of continuous-time nonlinear

systems, the reader is referred to Luo et al. 41 , Wang et al. 42 , Tang and Daoutidis 43 and the references therein.

The promise of RL to deliver a real-time, self-learning controller in a model-free and adaptive environment

has long motivated the process systems community to explore novel approaches to apply RL in process control

applications. While the plethora of studies from over last two decades has provided significant insights to help

connect the RL paradigm with process control, they also highlight the nontrivial nature of this connection and

4

the limitations of existing RL methods in control applications. Motivated by recent developments in the area of

deep reinforcement learning (DRL), we revisit the problem of RL-based process control and explore the feasibility

of using DRL to bring these goals one step closer.

Deep Reinforcement Learning (DRL)

The recent resurgence of interest in RL results from the successful combination of RL with deep learning that

allows for effective generalization of RL to MDPs with continuous state spaces. The Deep-Q-Network (DQN)

proposed recently in Mnih et al. 19 combines deep learning for sensory processing44 with RL to achieve human-

level performance on many Atari video games. Using unprocessed pixels as inputs, simply through interactions,

the DQN can learn in real time the optimal strategy to play Atari. This was made possible using a deep neural

network function approximator (FA) to estimate the action-value function over the continuous state space, which

was then maximized over the action space to find the optimal policy. Before DQN, learning the action-value

function using FAs was widely considered difficult and unstable. Two innovations account for this latest gain

in stability and robustness: (a) the network is trained off-policy with samples from a replay buffer to minimize

correlations between samples, and (b) the network is trained with a target Q-network to give consistent targets

during temporal difference backups.

In this paper, we propose an off-policy actor-critic algorithm, referred to as a DRL controller, for controlling of

discrete-time nonlinear processes. The proposed DRL controller is a model-free controller designed based on TD

learning. As a data-based controller, the DRL controller uses two independent deep neural networks to generalize

the actor and critic to continuous state and action spaces. The DRL controller is based on the deterministic policy

gradient (DPG) algorithm proposed by Silver et al. 45 that combines an actor-critic method with insights from

DQN. The DRL controller uses ideas similar to those in Lillicrap et al. 18 , modified to make learning suitable for

process control applications. Several simulation examples of different complexities are presented to demonstrate

the efficacy of the DRL controller in set-point tracking problems.

The rest of the paper is organized as follows: in Section b, we introduce the basics of MDP and derive a

control policy based on value functions. Motivated by the intractability of optimal policy calculations for infinite

MDPs, we introduce Q-learning in Section b for solving RL problem over continuous state space. A policy

gradient algorithm is discussed in Section b to extend the RL solution to continuous action space. Combining

the developments in Sections b and b, a novel actor-critic algorithm is discussed in Section b. In Section b, a

DRL framework is proposed for data-based control of discrete-time nonlinear processes. The efficacy of a DRL

controller is demonstrated on several examples in Section b. Finally, Section b compares a DRL controller to an

5

MPC.

This paper follows a tutorial-style presentation to assist readers unfamiliar with the theory of RL. The material

is systematically introduced to highlight the challenges with existing RL solutions in process control applications

and to motivate the development of the proposed DRL controller. The background material presented here is

only introductory, and readers are encouraged to refer to the cited references for a detailed exposition on these

topics.

The Reinforcement Learning (RL) Problem

The RL framework consists of a learning agent (e.g., a controller) interacting with a stochastic environment (e.g.,

a plant or process), denoted by E , in discrete time steps. The objective in an RL problem is to identify a policy

(or control actions) to maximize the expected cumulative reward the agent receives in the long run11,15. The

RL problem is a sequential decision-making problem, in which the agent incrementally learns how to optimally

interact with the environment by maximizing the expected reward it receives. Intuitively, the agent-environment

interaction is to be understood as follows. Given a state space S and an action space A, the agent at time

step t ∈ N observes some representation of the environment’s state st ∈ S and on that basis selects an action

at ∈ A. One time step later, in part as a consequence of its action, the agent finds itself in a new state st+1 ∈ S

and receives a scalar reward rt ∈ R from the environment indicating how well the agent performed at t ∈ N.

This procedure repeats for all t ∈ N, as in the case of a continuing task or until the end of an episode. The

agent-environment interactions are illustrated in Figure 1.

Agent

Environment

Act

ion

at

Rew

ard

r t

Sta

tes t

st+1

rt+1

Figure 1: A schematic of the agent-environment interactions in a standard RL problem.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) consists of the following: a state space S; an action space A; an initial state

distribution p(s1); and a transition distribution p(st+1|st, at)1 satisfying the following Markov property

p(st+1|st, at, . . . , s1, a1) = p(st+1|st, at), (1)

1To simplify notation, we drop the random variable in the conditional density and write p(st+1|st, at) = p(st+1|St = st, At = at).

6

for any trajectory s1, a1 . . . , sT , aT generated in the state-action space S ×A (for continuing tasks, T →∞, while

for episodic tasks, T ∈ N is the terminal time); and a reward function rt : S ×A → R. In an MDP, the transition

function (1) is the likelihood of observing a state st+1 ∈ S after the agent takes an action at ∈ A in state st ∈ S.

Thus∫S p(s|st, at) ds = 1 for all (st, at) ∈ S ×A.

A policy is used to describe the actions of an agent in the MDP. A stochastic policy is denoted by π : S → P(A),

where P(A) is the set of probability measures on A. Thus π(at|st) is the probability of taking action at when

in state st; we have∫A π(a|st) da = 1 for each st ∈ S. This general formulation allows any deterministic policy

µ : S → A, through the definition

π(a|st) =

1 if a = µ(st),

0 otherwise.

(2)

The agent uses π to sequentially interact with the environment to generate a sequence of states, actions and

rewards in S × A × R, denoted generically as h = (s1, a1, r1, . . . , sT , aT , rT ), where (H = h) ∼ pπ(h) is an

arbitrary history generated under policy π and distributed according to the probability density function (PDF)

pπ(·). Note that under the Markov assumption of the MDP, the PDF for the history can be decomposed as

follows:

pπ(h) = p(s1)

T∏t=1

p(st+1|st, at)π(at|st).

The total reward accumulated by the agent in a task from time t ∈ N onward is

Rt(h) =

∞∑k=t

γk−trt(sk, ak), (3)

Here γ ∈ [0, 1] is a (user-specified) discount factor: a unit reward has the value 1 right now, γ after one time step,

and γτ after τ sampling intervals. If we specify γ = 0, only the immediate reward has any influence; if we let

γ → 1, future rewards are considered more strongly—informally, the controller becomes ‘farsighted.’ While (3)

is defined for continuing tasks, i.e., T →∞, the notation can be applied also for episodic tasks by introducing a

special absorbing terminal state that transitions only to itself and generates rewards of 0. These conventions are

used to simplify the notation and to express close parallels between episodic and continuing tasks. See Sutton

and Barto 11 .

Value Function and Optimal Policy

The state-value function, V π : S → R assigns a value to each state, according to the given policy π. In state st,

V π(st) is the value of the future reward an agent is expected to receive by starting at st ∈ S and following policy

7

π thereafter. In detail,

V π(st) = Eh∼pπ(·)[Rt(h)|st]. (4)

The closely-related action-value function Qπ : S × A → R decouples the immediate action from the policy π,

assuming only that π is used for all subsequent steps:

Qπ(st, at) = Eh∼pπ(·)[Rt(h)|st, at]. (5)

The Markov property of the underlying dynamic process gives these two value functions a recurrent structure

illustrated below:

Qπ(st, at) = Est+1∼p(·|st,at) [r(st, at) + γπ(at+1|st+1)Qπ(st+1, at+1)] . (6)

Solving an RL problem amounts to finding a policy π? that outperforms all other policies across all possible

scenarios. Identifying π∗ will yield an optimal Q-function, Q?, such that

Q?(st, at) = maxπ

Qπ(st, at), (st, at) ∈ S ×A. (7)

Conversly, knowing the function Q? is enough to recover an optimal policy by making a “greedy” choice of action:

π?(a|st) =

1, if a = arg maxa∈AQ

?(st, a),

0, otherwise.

(8)

Note that this policy is actually deterministic. While solving the Bellman equation in (6) for the Q-function

provides an approach to finding an optimal policy in (8), and thus solving the RL problem, this solution is

rarely useful in practice. This is because the solution relies on two key assumptions – (a) the dynamics of the

environment is accurately known, i.e., p(st+1|st, at) is exactly known for all (st, at) ∈ S × A; and (b) sufficient

computational resources are available to calculate Qπ(st, at) for all (st, at) ∈ S × A. These assumptions are

a major impediment for solving process control problems, wherein, complex process behaviour might not be

accurately known, or might change over time, and the state and action spaces may be continuous. For an RL

solution to be practical, one typically needs to settle for approximate solutions. In the next section, we introduce

Q-learning that approximates the optimal Q-function (and thus the optimal policy) using the agent’s experiences

(or samples) as opposed to process knowledge. Such class of approximate solutions to the RL problem is called

the model-free RL methods.

Q-learning

Q-learning is one of the most important breakthroughs in RL46,47. The idea is to learn Q? directly, instead of

first learning Qπ and then computing Q? in (7). Q-learning constructs Q? through successive approximations.

8

Similar to (6), using the Bellman equation, Q? satisfies the identity

Q?(st, at) = Est+1∼p(·|st,at)[r(st, at) + γ maxa′∈A

Q?(st+1, a′)]. (9)

There are several ways to use (9) to compute Q?. The standard Q-iteration (QI) is a model-based method

that requires complete knowledge of the states, the transition function, and the reward function to evaluate

the expectation in (9). Alternatively, temporal difference (TD) learning is a model-free method that uses sam-

pling experiences, (st, at, rt, st+1), to approximate Q? 11. This is done as follows. First, the agent explores the

environment by following some stochastic behaviour policy, β : S → P(A), and receiving an experience tuple

(st, at, rt, st+1) at each time step. The generated tuple is then used to improve the current approximation of Q?,

denoted Qi, as follows:

Qi+1(st, at)← Qi(st, at) + αδ, (10)

where α ∈ (0, 1] is the learning rate, and δ is the TD error, defined by

δ = r(st, at) + γ maxa′∈A

Qi(st+1, a′)− Qi(st, at). (11)

The conditional expectation in (9) falls away in TD learning in (10), since st+1 in (st, at, rt, st+1) is distributed

according to the target density, p( · |st, at). Making the greedy policy choice using Qi+1 instead of Q? (recall (8))

produces

πi+1(st) = arg maxa∈A

Qi+1(st, a), (12)

where πi+1 is a greedy policy based on Qi+1. In the RL literature, (10) is referred to as policy evaluation and (12)

as policy improvement. Together, steps (10) and (12) are called Q-learning47. Pseudo-code for implementing this

approach is given in Algorithm 1. Algorithm 1 is a model-free, on-line and off-policy algorithm. It is model-free as

it does not require an explicit model of the environment, and on-line because it only utilizes the latest experience

tuple to implement the policy evaluation and improvement steps. Further, Algorithm 1 is off-policy because the

agent acts in the environment according to its behaviour policy β, but still learns its own policy, π. Observe that

the behaviour policy in Algorithm 1 is ε-greedy, in that it generates greedy actions for the most part but has a

non-zero probability, ε, of generating a random action. Note that off-policy is a critical component in RL as it

ensures a combination of exploration and exploitation. Finally, for Algorithm 1, it can be shown that Qi → Q?

with probability 1 as i→∞47.

Q-learning with Function Approximation

While Algorithm 1 enjoys strong theoretical convergence properties, it requires storing the Q-values for all state-

action pairs (s, a) ∈ S × A. For applications like control, where both S and A are infinite sets, some further

9

Algorithm 1 Q-learning

1: Output: Action-value function Q(s, a)2: Initialize: Arbitrarily set Q, e.g., to 0 for all states, set Q for terminal states as 03: for each episode do4: Initialize state s5: for each step of episode, state s is not terminal do6: a← action for s derived by Q, e.g., ε-greedy7: take action a, observe r and s′

8: δ ← r + γmaxa′ Q(s′, a′)−Q(s, a)9: Q(s, a)← Q(s, a) + αδ

10: s← s′

11: end for12: end for

simplification is required. Several authors have proposed space discretization methods48. While such discretiza-

tion methods may work for simple problems, in general, they are not efficient in capturing the complex dynamics

of industrial processes.

The problem of generalizing Q-learning from finite to continuous spaces has been studied extensively over

the last two decades. The basic idea in continuous spaces is to use a function approximator, or FA. A function

approximator Q(s, a, w) is a parametric function Q(s, a, w), whose parameters w are chosen to make Q(s, a, w) ≈

Q(s, a) for all (s× a) ∈ S ×A. This is achieved by minimizing the following quadratic loss function L:

Lt(w) = Est∼ρβ(·),at∼β(·|st)[(yt −Q(st, at, w))2

], (13)

where yt is the target and ρβ is a discounted state visitation distribution under behaviour policy β. The role of ρβ

is to weight L based on how frequently a particular state is expected to be visited. Further, as in any supervised

learning problem, the target, yt is given by Q?(st, at); however, since Q? is unknown, it can be replaced with its

approximation. A popular choice of an approximate target is a bootstrap target (or a TD target), given as

yt = Est+1∼p(·|st,at)[r(st, at) + γ maxa′∈A

Q(st+1, a′, w)]. (14)

In contrast to supervised learning, where the target is typically independent of model parameters, the target in

(14) depends on the FA parameters. Finally, (13) can be minimized using a stochastic gradient descent (SGD)

algorithm. An SGD is an iterative optimization method that adjusts w in the direction that would most reduce

L(w) for it. The update step for SGD is given as follows

wt+1 ← wt −1

2αc,t∇Lt(wt), (15)

where wt and wt+1 are the old and new parameter values, respectively, and αc,t is a positive step-size parameter.

Given (13), this gradient can be calculated as follows

∇Lt(wt) = −2Est∼ρβ(·),at∼β(·|st)[yt −Q(st, at, wt)]∇wQ(st, at, wt). (16)

10

To derive (16), it is assumed that yt is independent of w. Note that this is a common assumption in Q-learning

with TD targets11. Finally, after updating wt+1 (and computing Q(st+1, a′, wt+1)), the optimal policy can be

computed as follows

π(st+1) = arg maxa′∈A

Q(st+1, a′, wt+1). (17)

The pseudo-code for Q-learning with FA is given in Algorithm 2.

Algorithm 2 Q-learning with FA

1: Output: Action value function Q(s, a, w)2: Initialize: Arbitrarily set action-value function weights w (e.g., w = 0)3: for each episode do4: Initialize state s5: for each step of episode, state s is not terminal do6: a← action for s derived by Q, e.g., ε-greedy7: take action a, observe r and s′

8: y ← r + γmaxa′ Q(s′, a′, w)9: w ← w + αc(y −Q(s, a, w))∇wQ(s, a, w)

10: s← s′


The effectiveness of Algorithm 2 depends on the choice of FA. Over the past decade, various FAs, both

parametric and non-parametric, have been proposed, including linear basis, Gaussian processes, radial basis, and

Fourier basis. For most of these choices, Q-learning with TD targets may be biased49. Further, unlike Algorithm

1, the asymptotic convergence of Q(s, a, w) → Q(s, a) with Algorithm 2 is not guaranteed. This is primarily

due to Algorithm 2 using an off-policy (i.e., behaviour distribution), bootstrapping (i.e., TD target) and FA

approach—a combination known in the RL community as the ‘deadly triad’11. The possibility of divergence

in the presence of the deadly triad are well known. Several examples have been published: see Tsitsiklis and

Van Roy 50 , Baird 51 , Fairbank and Alonso 52 . The root cause for the instability remains unclear—taken one

by one, the factors listed above are not problematic. There are still many open problems in off-policy learning.

Despite the lack of theoretical convergence guarantees, all three elements of the deadly triad are also necessary for

learning to be effective in practical applications. For example, FA is required for scalability and generalization,

bootstrapping for computational and data efficiency, and off-policy learning for decoupling the behaviour policy

from the target policy. Despite the limitations of Algorithm 2, recently, Mnih et al. 20,19 have successfully adapted

Q-learning with FAs to learn to play Atari games from pixels. For details, see Kober et al. 53 .

Policy Gradient

While Algorithm 2 generalizes Q-learning to continuous spaces, the method lacks convergence guarantees except

for with linear FAs, where it has been shown not to diverge. Moreover, target calculations in (14) and greedy

11

action calculations in (17) require maximization of the Q-function over the action space. Such optimization

steps are computationally impractical for large and unconstrained FAs, and for continuous action spaces. Control

applications typically have both these complicating characteristics, making Algorithm 2 is nontrivial to implement.

Algorithm 3 Policy Gradient – REINFORCE

1: Output: Optimal policy π(a|s, θ)2: Initialize: Arbitrarily set policy parameters θ3: for true do4: generate an episode s0, a0, r1, . . . , sT−1, aT−1, rT , following π(·|·, θ)5: for each step t of episode 0, 1, . . . T − 1 do6: Rt ← return from step t7: θ ← θ + αa,tγ

tRt∇θπ(at|st, θ)8: end for9: end for

Instead of approximating Q? and then computing π (see Algorithms 1 and 2), an alternative approach is to

directly compute the optimal policy, π? without consulting the optimal Q-function. The Q-function may still

be used to learn the policy, but is not required for action selection. Policy gradient methods are reinforcement

Learning algorithms that work directly in the policy space. Two separate formulations are available: the average

reward formulation and the start state formulation. In the start state formulation, the goal of an agent is to

obtain a policy πθ, parameterized by θ ∈ Rnθ , that maximizes the value of starting at state s0 ∈ S and following

policy πθ. For a given policy, πθ, the performance of the agent can be evaluated as follows:

J(πθ) = V πθ (s0) = Eh∼pπ(·)[R1(h)|s0], (18)

Observe that the agent’s performance in (18) is completely described by the policy parameters in θ. A policy

gradient algorithm maximizes (18) by computing an optimal value of θ. A stochastic gradient ascent (SGA)

algorithm incrementally adjusts θ in the direction of ∇θJ(πθ), such that

θt+1 ← θt + αa,t∇θJ(πθ)|θ=θt , (19)

where αa,t is the learning rate. Note that calculating ∇θJ(πθ) requires access to the distribution of states under

the current policy, which as noted earlier, is unknown for most processes. The implementation of policy gradient is

made effective by the policy gradient theorem that calculates a closed-form solution for ∇θJ(πθ) without reference

to the state distribution54. The policy gradient theorem establishes that

∇θJ(πθ) = Est∼ρπγ (·), at∼πθ(·|st)[Qπ(st, at)∇θ log πθ(at|st)

], (20)

where ρπγ (s) :=∑∞t=0 γ

tp(st = s|s0, πθ) is the discounted state visitation distribution and p(st|s0, π) is a t-step

ahead state transition density from s0 to st. Equation (20) gives a closed-form solution for the gradient in (19) in

terms of the Q-function and the gradient of the policy being evaluated. Further, (20) assumes a stochastic policy

12

(observe the expectation over the action space). Now, since the expectation is over the policy being evaluated,

i.e.. πθ, (20) is an on-policy gradient. Note that an off-policy gradient theorem can also be derived for a class of

stochastic policies. See Degris et al. 55 for details.

To implement (19) with the policy gradient theorem in (20), we replace the expectation in (19) with its

sample-based estimate, and replace Qπ with the actual returns, Rt. This leads to a policy gradient algorithm,

called REINFORCE56. Pseudo-code for REINFORCE is given in Algorithm 3. In contrast to Algorithms 1

and 2, Algorithm 3 avoids solving complex optimization problems over continuous action spaces and generalizes

effectively in continuous spaces. Further, unlike Q-learning that always learns a deterministic greedy policy,

Algorithm 3 supports both deterministic and stochastic policies. Finally, Algorithm 3 exhibits good convergence

properties, with the estimate in (19) guaranteed to converge to a local optimum if the estimation of ∇θJ(πθ) is

unbiased (see Sutton et al. 54 for a detailed proof).

Despite the advantages of Algorithm 3 over traditional Q-learning, Algorithm 3 is not amenable to online

implementation as it requires access to Rt – the total reward the agent is expected to receive at the end of an

episode. Further, replacing the Q-function by Rt leads to large variance in the estimation of ∇θJ(πθ)54,57, which

in turn leads to slower convergence58. An approach to address the issues mentioned above is to use a low-variance,

bootstrapped estimate of the Q-function, as in the actor-critic architecture.

Figure 2: A schematic of the actor-critic architecture

Actor-Critic Architecture

The actor-critic is a widely used architecture that combines the advantages of policy gradient withQ-learning54,55,59.

Like policy gradient, actor-critic methods generalize to continuous spaces, while the issue of large variance is coun-

tered by bootstrapping, such as Q-learning with TD update. A schematic of the actor-critic architecture is shown

in Figure 2. The actor-critic architecture consists of two eponymous components: an actor that finds an optimal

policy, and a critic that evaluates the current policy prescribed by the actor. The actor implements the pol-

13

icy gradient method by adjusting policy parameters using SGA, as shown in (19). The critic approximates the

Q-function in (20) using an FA. With a critic, (20) can be approximately written as follows:

∇θJ(πθ) = Est∼ρπγ (·),at∼πθ(·|st)[Qπ(st, at, w)∇θ log πθ(at|st)

], (21)

where w ∈ Rnw is recursively estimated by the critic using SGD in (15) and θ ∈ Rnθ is recursively estimated

by the actor using SGA by substituting (21) into (19). Observe that while the actor-critic method combines

the policy gradient with Q-learning, the policy is not directly inferred from Q-learning, as in (17). Instead, the

policy is updated in the policy gradient direction in (19). This avoids the costly optimization in Q-learning,

and also ensures that changes in the Q-function only result in small changes in the policy, leading to less or no

oscillatory behaviour in the policy. Finally, under certain conditions, implementing (19) with (21) guarantees

that θ converges to the local optimal policy54,55.

Deterministic Actor-Critic Method

For a stochastic policy πθ, calculating the gradient in (20) requires integration over the space of states and actions.

As a result, computing the policy gradient for a stochastic policy may require many samples, especially in high-

dimensional action spaces. To allow for efficient calculation of the policy gradient in (20), Silver et al. 45 propose

a deterministic policy gradient (DPG) framework. This assumes that the agent follows a deterministic policy

µθ : S → A. For a deterministic policy, at = µθ(st) with probability 1, where θ ∈ Rnθ is the policy parameter.

The corresonding performance in (18) can be written as follows

J(µθ) = V µθ (s0) = Eh∼pµ(·)[ ∞∑t=1

γk−1r(st, µθ(st))|s0]. (22)

Similar to the policy gradient theorem in (20), Silver et al. 45 proposed a deterministic gradient theorem to

calculate the gradient of (22) with respect to the policy parameter θ:

∇θJ(µθ) = Est∼ρµγ (·)[∇aQµ(st, a)|a=µθ(st)∇θµθ(st)

], (23)

where ρµγ is the discounted state distribution under policy µθ (similar to ρπγ in (20)). Note that unlike (20),

the gradient in (23) only involves expectation over the states generated according to µθ. This makes the DPG

framework computationally more efficient to implement compared to the stochastic policy gradient.

To ensure that the DPG continues to explore the state and action spaces satisfactorily, it is possible to

implement DPG off-policy. For a stochastic behaviour policy β and a deterministic policy µθ, Silver et al. 45

showed that a DPG exists and can be analytically calculated as

∇θJ(µθ) = Est∼ρβγ (·)[∇aQµ(st, a)|a=µθ(st)∇θµθ(st)

], (24)

14

Algorithm 4 Deterministic Off-policy Actor-Critic Method

1: Output: Optimal policy µθ(s)2: Initialize: Arbitrarily set policy parameters θ and Q-function weights w3: for true do4: initialize s, the first state of the episode5: for s is not terminal do6: a ∼ β(·|s)7: take action a, observe s′ and r8: y ← r + γQµ(s′, µθ(s

′), w)9: w ← w + αw(y −Qµ(s, µθ(s), w))∇wQµ(s, a, w)

10: θ ← θ + αθ∇θµθ(s)∇aQµ(s, a, w)|a=µθ(s)11: s← s′


Figure 3: A deep neural network representation of (a) the actor, and (b) the critic. The red circles represent the input and outputlayers and the black circles represent the hidden layers of the network.

where ρβγ is the discounted state distribution under behaviour policy β. Compared to (23), the off-policy DPG in

(24) involves expectation with respect to the states generated by a behaviour policy β. Finally, the off-policy DPG

can be implemented using the actor-critic architecture. The policy parameters, θ, can be recursively updated by

the actor using (19), where ∇θJ(µθ) is given in (24); and the Q-function, Qµ(st, a) in (24) is replaced with a

critic, Qµ(st, a, w), whose parameter vector w is recursively estimated using (15). Pseudo-code for the off-policy

deterministic actor-critic is given in Algorithm 4.

Deep Reinforcement Learning (DRL) Controller

In this section, we connect the terminologies and methods for RL discussed in Section b, and propose a new

controller, referred to as a DRL controller. The proposed DRL controller is a model-free controller based on

DPG. The DRL controller is implemented using the actor-critic architecture and uses two independent deep

neural networks to generalize the actor and critic to continuous state and action spaces.

15

States, Actions and Rewards

We consider discrete dynamical systems with input and output sequences ut and yt, respectively. For simplicity,

we focus on the case where the outputs yt contain full information on the state of the system to be controlled.

Removing this hypothesis is an important practical element of our ongoing research in this area.

The correspondence between the agent’s action in the RL formulation and the plant input from the control

perspective is direct: we identify at = ut. The relationship between the RL state and the state of the plant is

subtler. The RL state, st ∈ S, must capture all the features of the environment on the basis of which the RL

agent acts. To ensure that the agent has access to relevant process information, we define the RL state as a tuple

of the current and past outputs, past actions and current deviation from the set-point ysp, such that

st := 〈yt, . . . , yt−dy , at−1, . . . , at−da , (yt − ysp)〉, (25)

where dy ∈ N and da ∈ N denote the number of past output and input values, respectively. In this paper, it is

assumed that dy and da are known a priori. Note that for da = 0, dy = 0, the RL state is ‘memoryless’, in that

st := 〈yt, (yt − ysp)〉. (26)

Further, we explore only deterministic policies expressed using a single-valued function µ : S → A, so that for

each state st ∈ S, the probability measure on A defining the next action puts weight 1 on the single point µ(st).

The goal for the agent in a set-point tracking problem is to find an optimal policy, µ, that reduces the tracking

error. This objective is incorporated in the RL agent by means of a reward function r : S × A × S → R, whose

aggregate value the agent tries to maximize. In contrast with MPC, where the controller minimizes the tracking

error over the space of control actions, here the agent maximizes the reward it receives over the space of policies.

We consider two reward functions.

The first – an `1-reward function – measures the negative `1-norm of the tracking error. Mathematically, for

a multi-input and multi-output (MIMO) system with ny outputs, the `1-reward is

r(st, at, st+1) = −ny∑i=1

|yi,t − yi,sp|, (27)

where yi,t ∈ Y are the i-th output, yi,sp ∈ Y is the set-point for the i-th output. Variants of the `1-reward function

are presented and discussed in Section b. The second reward – a polar reward function – assigns a 0 reward if the

tracking error is a monotonically decreasing function at each sampling time for all ny outputs or −1 otherwise.

Mathematically, the polar reward function is

r(st, at, st+1) =

0 if |yi,t − yi,sp| > |yi,t+1 − yi,sp| ∀i ∈ {1, . . . , ny}

−1 otherwise

(28)

16

Observe that a polar reward (28) incentivizes gradual improvements in tracking performance, which leads to less

aggressive control strategy and a smoother tracking compared to the `1-reward in (27).

Policy and Q-function Approximations

In this section, we discuss the FAs used by the DRL controller to approximate the policy and the Q-function. We

use neural networks to generalize the policy and the Q-function to continuous state and action spaces (see b for

basics of neural networks). The policy, µ, is represented using a deep feed-forward neural network, parameterized

by weights Wa ∈ Rna , such that given st ∈ S and Wa, the policy network produces an output at = µ(st,Wa).

Similarly, the Q-function is also represented using a deep neural network, parameterized by weights Wc ∈ Rnc ,

such that given (st, at) ∈ S ×A and Wc, the Q-network outputs Qµ(st, at,Wc).

Figure 4: A schematic of the proposed DRL controller architecture for set-point tracking problems.

Learning Control Policies

As noted, the DRL controller uses a deterministic off-policy actor-critic architecture to learn the control policy

in set-point tracking problems. The proposed method is similar to Algorithm 4 or the one proposed by Lillicrap

et al. 18 , but modified for set-point tracking problems. In the proposed DRL controller architecture, the actor is

represented by µ(st,Wa) and a critic is represented by Q(st, at,Wc). A schematic of the proposed DRL controller

architecture is illustrated in Figure 3. The critic network predicts Q-values for each state-action pair, and the

actor network proposes an action for a given state. The goal is then to learn the actor and critic neural network

parameters by interacting with the process plant. Once the networks are trained, the actor network is used to

compute the optimal action for any given state.

For a given actor-critic architecture, the network parameters can be readily estimated using SGD (or SGA);

however, these methods are not effective in applications, where the state, action and reward sequences are tem-

porally correlated. This is because SGD provides a 1-sample network update, assuming that the samples are

independently and identically distributed (iid). For dynamic systems, for which a tuple (st, at, rt, st+1) may be

17

temporally correlated to the past tuples (e.g., (st−1, at−1, rt−1, st)), up to the Markov order, such iid network

updates in presence of correlations are not effective.

To make learning more effective for dynamic systems, we propose to break the inherent temporal correlations

between the tuples by randomizing it. To this effect, we use a batch SGD, rather than a single sample SGD

for network training. As in DQN, we use a replay memory (RM) for batch training of the networks. The RM

is a finite memory that stores a a large number of tuples, denoted as {(s(i), a(i), r(i), s′(i))}Ki=1, where K is the

size of the RM. As a queue data structure, the latest tuples are always stored in the RM, and the old tuples

are discarded to keep the cache size constant. At each time, the network is updated by uniformly sampling M

(M ≤ K) tuples from the RM. The critic update in Algorithm 4 with RM results in a batch stochastic gradient,

with the following update step

Wc ←Wc +αcM

M∑i=1

(y(i) −Q(s(i), µ(s(i),Wa),Wc))∇WcQµ(s(i), µ(s(i),Wa),Wc), (29)

where

y(i) ← r(i) + γQµ(s′(i), µ(s′(i),Wa),Wc), (30)

for all i = 1, . . . ,M . The batch update, similar to (29), can also be derived for the actor network using an RM.

We propose another strategy to further stabilize the actor-critic architecture. First, observe that the parame-

ters of the critic network, Wc, being updated in (29) are also used in calculating the target, y in (30). Recall that,

in supervised learning, the actual target is independent of Wc; however, as discussed in Section b, in absence of

actual targets, we use Wc in (30) to approximate the target values. Now, if the Wc updates in (29) are erratic,

then the target estimates in (30) are also erratic, and may cause the network to diverge, as observed in Lillicrap

et al. 18 . We propose to use a separate network, called target network to estimate the target in (30). Our solution

is similar to the target network used in Lillicrap et al. 18 , Mnih et al. 19 . Using a target network, parameterized

by W ′c ∈ Rnc , (30) can be written as follows

y(i) ← r(i) + γQµ(s′(i), µ(s′(i),W ′a),W ′c), (31)

where

W ′c ← τWc + (1− τ)W ′c, (32)

and 0 < τ < 1 is the target network update rate. Observe that the parameters of the target network in (32) are

updated by having them slowly track the parameters of the critic network, Wc. In fact, (32) ensures that the

target values change slowing, thereby improving the stability of learning. A similar target network can also be

used for stabilizing the actor network.

18

Note that while we consider the action space to be continuous, in practice, it is also often bounded. This is

because the controller actions typically involve changing bounded physical quantities, such as flow rates, pressure

and pH. To ensure that the actor network produces feasible controller actions, it is important to bound the

network over the feasible action space. Note that if the networks are not constrained then the critic network will

continue providing gradients that encourage the actor network to take actions beyond the feasible space.

In this paper, we assume that the action space, A is an interval action space, such that A = [aL, aH ], where

aL < aH (the inequality holds element-wise for systems with multiple inputs). One approach to enforce the

constraints on the network is to bound the output layer of the actor network. This is done by clipping the

gradients used by the actor network in the update step (see Step 10 in Algorithm 4). For example, using the

clipping gradient method proposed in60, the gradient used by the actor, ∇aQµ(s, a, w), can be clipped as follows

∇aQµ(s, a, w)← ∇aQµ(s, a, w)×

(aH − a)/(aH − aL), if ∇aQµ(s, a, w) increases a

(a− aL)/(aH − aL), otherwise

(33)

Note that in (33), the gradients are down-scaled as the controller action approaches the boundaries of its range,

and are inverted if the controller action exceeds the range. With (33) in place, even if the critic continually

recommends increasing the controller action, it will converge to the its upper bound, aH . Similarly, if the critic

decides to decrease the controller action, it will decrease immediately. Using (33) ensures that the controller

actions are within the feasible space. Now, putting all the improvement strategies discussed in this section,

a schematic of the proposed DRL controller architecture for set-point tracking is shown in Figure 4, and the

pseudo-code is provided in Algorithm 5.

As outlined in Algorithm 5, first we randomly initialize the parameters of the actor and critic networks (Step

2). We then create a copy of the network parameters and initialize the parameters of the target networks (Step

3). The final initialization step is to supply the RM with a large enough collection of tuples (s, a, s′, r) on which

to begin training the RL controller (Step 4). Each episode is preceded by two steps. First, we ensure the state

definition st in (25) is defined for the first dy steps by initializing a sequence of actions 〈a−1, . . . , a−n〉 along with

the corresponding sequence of outputs 〈y0, . . . , y−n+1〉 where n ∈ N is sufficiently large. A set-point is then fixed

for the ensuing episode (Steps 6-7).

Next, for a given user-defined set-point, the actor, µ(s,Wa) is queried, and a control action is implemented on

the process (Steps 8–10). Implementing µ(s,Wa) on the process generates a new output yt+1. The controller then

receives a reward r, which is the last piece of information needed to store the updated tuple (s, a, r, s′) in the RM

(Steps 9–12). M uniformly sampled tuples are generated from the RM to update the actor and critic networks

19

Algorithm 5 Deep Reinforcement Learning Controller

1: Output: Optimal policy µ(s,Wa)2: Initialize: Wa,Wc to random initial weights3: Initialize: W ′a ←Wa and W ′c ←Wc

4: Initialize: Replay memory with random policies5: for each episode do6: Initialize an output history 〈y0, . . . , y−n+1〉 from an action history 〈a−1, . . . , a−n〉7: Set ysp ← set-point from the user8: for each step t of episode 0, 1, . . . T − 1 do9: Set s← 〈yt, . . . , yt−dy , at−1, . . . , at−da , (yt − ysp)〉

10: Set at ← µ(s,Wa) +N11: Take action at, observe yt+1 and r12: Set s′ ← 〈yt+1, . . . , yt+1−dy , at, . . . , at+1−da , (yt+1 − ysp)〉13: Store tuple (s, at, s

′, r) in RM14: Uniformly sample M tuples from RM15: for i = 1 to M do16: Set y(i) ← r(i) + γQµ(s′(i), µ(s′(i),W ′a),W ′c)17: end for18: Set Wc ←Wc + αc

M

∑Mi=1(y(i) −Qµ(s(i), a(i),Wc))∇Wc

Qµ(s(i), a(i),Wc)19: for i = 1 to M do20: Calculate ∇aQµ(s(i), a,Wc)|a=a(i)21: Clip ∇aQµ(s(i), a,Wc)|a=a(i) using (33)22: end for23: Set Wa ←Wa + αa

M

∑Mi=1∇Waµ(s(i),Wa)∇aQµ(s(i), a,Wc)|a=a(i)

24: Set W ′a ← τWa + (1− τ)W ′a25: Set W ′c ← τWc + (1− τ)W ′c26: end for27: end for

using batch gradient methods (Steps 14–23). Finally, for a fixed τ , we update the target actor and target critic

networks in Steps 24–25. The above steps are then repeated until the end of the episode.

To ensure that proposed method adequately explores the state and action spaces, the behavior policy for the

agent is constructed by adding noise sampled from a noise process, N , to our actor policy, µ (see Step 10 in

Algorithm 5). Generally, N is to be chosen to suit the process. We use a zero-mean Ornstein-Uhlenbeck (OU)

process61 to generate temporally correlated exploration samples; however, the user can define their own noise

process. We also allow for random initialization of system outputs at the start of an episode to ensure that the

policy is not stuck in a local optimum. Finally, it is important to highlight that Algorithm 5 is a fully automatic

algorithm that learns the control policy in real-time by continuously interacting with the process.

Network Structure and Implementation

The actor and critic neural networks each had 2 hidden layers with 400 and 300 units, respectively. We initialized

the actor and critic networks using uniform Xavier initialization62. Each unit was modeled using a Rectified

non-linearity activation function. Example b differs slightly: the output layers were initialized from a uniform

distribution over [−0.003, 0.003], and we elected to use a tanh activation for the second hidden layer. For all

examples, the network hidden layers were batch-normalized using the method in Ioffe and Szegedy 63 to ensure

20

that the training is effective in processes where variables have different physical units and scales. Finally, we

implemented the Adam optimizer in Kingma and Ba 64 to train the networks and regularized the weights and

biases of the second hidden layer with an L2 weight decay of 0.0001.

Algorithm 5 was scripted in Python and implemented on an iMac (3.2 GHz Intel Core i5, 8 GB RAM). The

deep networks for the actor and critic were built using Tensorflow65. For Example b, we trained the networks

on Amazon g2.2xlarge EC2 instances; the matrix multiplications were performed on a Graphics Processing

Unit (GPU) available in the Amazon g2.2xlarge EC2 instances. The hyper-parameters for Algorithm 5 are

process-specific and need to be selected carefully. We list in Table 1 the nominal values for hyper-parameters used

in all of our numerical examples (see Section b); additional specifications are listed at the end of each example.

An optimal selection of hyper-parameters is crucial, but is beyond the scope of the current work.

Table 1: The hyper-parameters used in Algorithm 5.

Hyper-parameter Symbol Nominal value

Actor learning rate αa 10−4

Critic learning rate αc 10−4

Target network update rate τ 10−3

OU process parameters N θ = .15, σ = .30

DRL Controller Tuning

In this section, we provide some general guidelines and recommendations for implementing the DRL controller

(see Algorithm 5) for set-point tracking problems.

(a) Episodes: To begin an episode, we randomly sample an action from the the action space and let the system

settle under that action before starting the time steps in the main algorithm. For the examples considered in

Section b, each episode is terminated after 200 time steps or when |yi,t − yi,sp| ≤ ε for all 1 ≤ i ≤ ny for

5 consecutive time steps, where ε is a user-defined tolerance. For processes with large time constants, it is

recommended to run the episodes longer to ensure that the transient and steady-state dynamics are effectively

captured in the input-output data.

(b) Rewards: We consider the reward hypotheses (27) and (28) in our examples. A possible variant of (27) is

given as follows

r(st, at, st+1) =

c if |yi,t − yi,sp| ≤ ε ∀i ∈ {1, 2, . . . , ny}

−∑nyi=1 |yi,t − yi,sp| otherwise.

(34)

where yi,t ∈ Y are the i-th output, yi,sp ∈ Y is the set-point for the i-th output, c ∈ R+ is a constant reward,

and ε ∈ R+ is a user-defined tolerance. According to (34), the agent receives the maximum reward, c, only if

21

all the outputs are within the tolerance limit set by the user. The potential advantage of (34) over (27) is that

it can lead to faster tracking. On the other hand, the control strategy learned by the RL agent is generally

more aggressive. Ultimately, we prefer the `1-reward function due the gradual increasing behavior of the function

as the RL agent learns. Note that under this reward hypothesis the tolerance ε in (34) should be the same as

the early termination tolerance described previously. These hypotheses are well-suited for set-point tracking as

they use tracking error to generate rewards; however, in other problems, it is imperative to explore other reward

hypothesis. Some examples, include – (i) negative of 2-norm of the control error instead of 1-norm in (27); (ii)

weighted rewards for different outputs in a MIMO system; (iii) economic reward function.

(c) Gradient-clipping: The gradient-clipping in (33) ensures that the DRL controller outputs are feasible and

within the operating range. Also, defining tight lower and upper limits on the inputs has a significant effect on

the learning rate. In general, setting tight limits on the inputs lead to faster convergence of the DRL controller.

(d) RL state: The definition of an RL state is critical as it affects the performance of the DRL controller. As

shown in (25), we use a tuple of current and past outputs, past actions and current deviation from the set-point

as the RL state. While (25) is well-suited for set-point tracking problems, it is instructive to highlight that the

choice of an RL state is not unique, as there may be other RL states for which the DRL controller could have

a similar or better performance. In our experience, including additional relevant information in the RL states

improves the overall performance and convergence-rate of the DRL controller. The optimal selection of an RL

state is not within the scope of the paper; however, it certainty is an important area of research that warrants

additional investigation.

(e) Replay memory: We initialize the RM with M tuples (s, a, s′, r) generated by simulating the system

response under random actions, where M is the batch size used for training the networks. During initial learning,

having a large RM is essential as it gives DRL controller access to the data generated from the old policy. Once

the actor and critic networks are trained, and the system attains steady-state, the RM need not be updated as

the input-output data no longer contributes new information to the RM.

(f) Learning: The DRL controller learns the policy in real-time and continues to refine it as new experiences are

collected. To reduce the computational burden of updating the actor and critic networks at each sampling time,

it is recommended to ‘turn-off’ learning once the set-point tracking is complete. For example, if the difference

between the output and the set-point is within certain predefined threshold, the networks need not be updated.

We simply terminate the episode once the DRL controller has tracked the set-point for 5 consecutive time steps.

(g) Exploration: Once the actor and critic networks are trained, it is recommended that the agent stops

exploring the state and action spaces. This is because adding exploration noise to a trained network adds

22

unnecessary noise to the control actions as well as the system outputs.

Simulation Results

The efficacy of the DRL controller outlined in Algorithm 5 is demonstrated on four simulation examples. The

first three examples, include: a single-input-single-output (SISO) paper-making machine; a multi-input-multi-

output (MIMO) high purity distillation column; and a nonlinear MIMO heating, ventilation, and air conditioning

(HVAC) system. The fourth example evaluates the robustness of Algorithm 5 to process changes.

Example 1: Paper Machine

In the pulp and paper industry, a paper machine is used to form the paper sheet and then remove water from

the sheet by various means, such as vacuuming, pressing, or evaporating. The paper machine is divided into two

main sections: wet-end and dry-end. The sheet is formed in the wet-end on a continuous synthetic fabric, and

the dry-end of the machine removes the water in the sheet through evaporation. The paper passes over rotating

steel cylinders, which are heated by super-heated pressurized steam. Effective control of the moisture content in

the sheets is an active area of research.

Figure 5: Simulation results for Example 1 – (a) moving average with window size 21 (the number of distinct set-points used intraining) for the total reward per episode; (b) tracking performance of the DRL controller on a sinusoidal reference signal; and (c)the input signal generated by the DRL controller.

In this section, we design a DRL controller to control the moisture content in the sheet, denoted by yt (in

percentage), to a desired set-point, ysp. There are several variables that affect the drying time of the sheet, such

as machine speed, steam pressure, drying fabric, etc. For simplicity, we only consider the effect of steam flow-rate,

denoted by at (in m3/hr) on yt. For simulation purposes, we assume that at and yt are related according to the

following discrete-time transfer function

G(z) =0.05z−1

1− 0.6z−1. (35)

Note that (35) is strictly used for generating episodes, and is not used in the design of the DRL controller. In

fact, the user has complete freedom to substitute (35) with any linear or nonlinear process model, in case of

simulations; or with the actual data from the paper machine, in case of industrial implementation. Next, we

23

implement Algorithm 5 with the hyper-parameters listed in Tables 1–2 and additional specifications discussed at

the end of the example.

Figure 6: Simulation results for Example 1 – (a) tracking performance of the DRL controller on randomly generated set-points outsideof the training interval [0, 10]; (b) the input signal generated by the DRL controller.

The generated data is then used for real-time training of the actor and critic networks in the DRL controller.

Figure 5(a) captures the increasing trend of the cumulative reward per episode by showing the moving average

across the number of set-points used in training. From Figure 5(a), it is clear that as the DRL controller interacts

with the process, it learns a progressively better policy, leading to higher rewards per episode. In less than 100

episodes, the DRL controller is able to learn a policy suitable for basic set-point tracking; it was able to generalize

effectively in the remaining 400 episodes, as illustrated in Figures 5(b)-(c) and 6.

Figures 5(b)-(c) showcase the DRL controller, putting aside the physical interpretation of our system (35).

We highlight the smooth tracking performance of the DRL controller on a sinusoidal reference signal in Figure

5(b), as this exemplifies the responsiveness and robustness of the DRL controller to tracking problems far different

from the fixed collection of set-points in the interval [0, 10] seen during training.

To illustrate this further, observe in Figure 6(a) that the DRL controller is able to track randomly generated

set-points with little overshoot, both inside and outside the training output space Y. Finally, noting that our

action space for training is [0, 100], Figure 6(b) shows the DRL controller is operating outside the action space

to track the largest set-point.

Table 2: Specifications for Example 1.


Episodes 500Mini-batch size M 128Replay memory size K 5× 104

Reward discount factor γ 0.99Action space A [0, 100]Output space Y [0, 10]

Further implementation details: We define the states according to (26) and select the `1-reward function in

24

(27). Concretely,

r(st, at, st+1) = −|yt − ysp|, (36)

where ysp is uniformly sampled from {0, 0.5, 1, . . . , 9.5, 10} at the beginning of each episode. We added zero-mean

Gaussian noise with variance .01 to the outputs during training. We defined the early-termination tolerance for

the episodes to be ε = .01.

Example 2: High-purity Distillation Column

In this section, we consider a two-product distillation column as shown in Figure 7. The objective of the distillation

column is to split the feed, which is a mixture of a light and a heavy component, into a distillate product, y1,t, and

a bottom product, y2,t. The distillate contains most of the light component and the bottom product contain most

of the heavy component. Figure 7 has multiple manipulated variables; however, for simplicity, we only consider

two inputs: boilup rate, u1,t and the reflux rate, u2,t. Note that the boilup and reflux rates have immediate

effect on the product compositions. The distillation column shown in Figure 7 has a complex nonlinear dynamics;

however, for simplicity, we linearize the model around the steady-sate value. The transfer function for the

distillation column is given as follows66

G(s) =

0.878

τs+ 1− 0.864

τs+ 1

1.0819

τs+ 1−1.0958

τs+ 1

, where τ > 0. (37)

Further, to include measurement uncertainties, we add a zero-mean Gaussian noise to (37) with variance 0.01.

The authors in66 showed that a distillation column with a model representation in (37) is ill-conditioned. In other

words, the plant gain strongly depends on the input direction, such that inputs in directions corresponding to

high plant gains are strongly amplified by the plant, while inputs in directions corresponding to low plant gains

are not. The issues around controlling ill-conditioned processes are well known to the community67,68.

Next, we implement Algorithm 5 with the reward hypothesis in (27). Figures 8(a) and (b) show the perfor-

mance of the DRL controller in tracking the distillate and the bottom product to desired set-points for a variety of

random initial starting points. Observe that starting from some initial condition, the DRL controller successfully

tracks the set-points within t = 75 minutes. Figures 8(c) and (d) show the boilup and reflux rates selected by the

DRL controller, respectively. This example, clearly highlights the efficacy of the DRL controller in successfully

tracking a challenging, ill-conditioned MIMO system. Finally, it is instructive to highlight that the model in (37)

is strictly for generating episodes, and is not used by the DRL controller. The user has freedom to replace (37)

either with a complex non-linear model, or with the actual plant data.

25

Condenser

holdupCondenser

DistillateReflux

ReboilerBoilup

Bottom Product

Reboilerholdup

Feed

Overheadvapor

y1,t

y2,t

u2,t

u1,t

Figure 7: A schematic of a two-product distillation column in Example 2. Adapted from66.

Further implementation details: First, note our system (37) needs to be discretized when implementing Al-

gorithm 5. We used the `1-reward hypothesis given in (27) and defined the RL state as in (26). To speed up

learning, it is important to determine which pairs of set-points make sense in practice. To address this, we

refer to Figure 9. We uniformly sample constant input signals from [0, 50] × [0, 50] and let the MIMO system

(37) settle for 50 time steps. Figure 9(a) shows these settled outputs. Clearly, we have very little flexibility

when initializing feasible set-points in Algorithm 5. Therefore, we only select set-points pairs (y1,sp, y2,sp) where

y1,sp, y2,sp ∈ {0, .5, 1, . . . , 4.5, 5} and |y1,sp − y2,sp| ≤ .5 Finally, Figure 9(a) shows the outputs in a restricted

subset of the plane, [−5, 10] × [−5, 10]; the outputs can far exceed this even when sampling action pairs within

the action space. A quick way to initialize an episode such that the outputs begin around the desired output

space, [0, 5] × [0, 5], is to initialize action pairs according to a linear regression such as the one shown in Figure

9(b); we also added zero-mean Gaussian noise with variance 1 to this line. We reiterate that these steps are

implemented purely to speed up learning, as the DRL controller will saturate and accumulate a large amount of

negative reward when tasked with set-points outside the scope of Figure 9(a). Further, this step does not solely

rely on the process model (37), as similar physical insights could be inferred from real plant data or by studying

the reward signals associated with each possible set-point pair as Algorithm 5 progresses. Figure 11 at the end of

Example b demonstrates a possible approach to determining feasible set-points and effective hyper-parameters.



Episodes 5,000Mini-batch size M 128Replay memory size K 5× 105


26

Figure 8: Simulation results for Example 2 – (a) and (b) set-point tracking for distillate and bottom product, respectively, where thedifferent colors correspond to different starting points; (c) and (d) boilup and reflux rates selected by the DRL controller, respectively.

Figure 9: Set-point selection for Example b – (a) the distribution of settled outputs after simulating the MIMO system (37) withuniformly sampled actions in [0, 50]× [0, 50]; (b) the corresponding action pairs such that the settled outputs are both within [0, 5],along with a linear regression of these samples.

Example 3: HVAC System

Despite the advances in research on HVAC control algorithms, most field equipment is controlled using classical

methods, such as hysteresis/on/off and Proportional Integral and Derivative (PID) controllers. Despite their

popularity, these classical methods do not perform optimally. The high thermal inertia of buildings induces

large time delays in the building dynamics, which cannot be handled efficiently by the simple on/off controllers.

Furthermore, due to the high non-linearity in building dynamics coupled with uncertainties such as weather, energy

pricing, etc., these PID controllers require extensive re-tuning or auto-tuning capabilities69, which increases the

difficulty and complexity of the control problem70.

Due to these challenging aspects of HVAC control, various advanced control methods have been investigated,

27

Figure 10: Simulation results for Example 3 – (a) the response of the discharge air temperature to the set-point change and (b) thecontrol signal output calculated by the DRL controller.

ranging from non-linear model predictive control71 to optimal control72. Between these methods, the quality of

control performance relies heavily on accurate process identification and modelling; however, large variations exist

with building design, zone layout, long-term dynamics and wide ranging operating conditions. In addition, large

disturbance effects from external weather, occupancy schedule changes and varying energy prices make process

identification a very challenging problem70.

In this section we demonstrate the efficacy of the proposed model-free DRL controller to control the HVAC

system. First, to generate episodes, we assume that the HVAC system can be modelled as follows73:

Two,t = Two,t−1 + θ1fw,t−1(Twi,t−1 − Two,t−1) +(θ2 + θ3fw,t−1 + θ4fa,t−1

)[Tai,t−1 − Tw,t−1], (38a)

Tao,t = Tao,t−1 + θ5fa,t−1(Tai,t−1 − Tao,t−1) + (θ6 + θ7fw,t−1 + θ8fa,t−1)(Tw,t−1 − Tai,t−1) + θ9(Tai,t − Tai,t−1),

(38b)

Tw,t = 0.5[Twi,t + Two,t], (38c)

where Two ∈ R+ and Tao ∈ R+ are the outlet water and discharge air temperatures (in ◦ C), respectively;

Twi ∈ R+ and Tai ∈ R+ are the inlet water and air temperatures (in ◦ C), respectively; fw ∈ R+ and fa ∈ R+

are the water and air mass flow rates (in kg/s), respectively; and θi ∈ R for i = 1, . . . , 9 are various physical

parameters, with nominal values given in73. The water flow rate, fw, is assumed to be related to the controller

action as follows:

fw,t = θ10 + θ11at + θ11a2t + θ12a

3t , (39)

where at is the control signal in terms of 12-bits and θj , for j = 10, 11, 12 are valve model parameters given in73.

In (38a)–(38c), Twi, Tai and fa are disturbance variables that account for the set-point changes and other

external disturbances. Furthermore, Twi, Tai and fa are assumed to follow constrained random-walk models, such

28

that

0.6 ≤ fa,t ≤ 0.9; 73 ≤ Twi,t ≤ 81; 4 ≤ Tai,t ≤ 10, (40)

for all t ∈ N. For simplicity, we assume that the controlled variable is Tao and the manipulated variable is fw. The

objective is to design a DRL controller to achieve desired discharge air temperature, yt ≡ Tao,t by manipulating

at.

Next, we implement Algorithm 5 to train the DRL controller for the HVAC system. Using the `1-reward

hypothesis in (27) we observed a persistent offset in Tao to set-point changes. Despite increasing the episode

length, the DRL controller was still unable to eliminate the offset with (27). Next, we implement Algorithm 5

with the polar reward hypothesis in (28). With (28), the response to control actions was found to be less noisy and

offset-free. In fact, Algorithm 5 performed much better with (28) in terms of noise and offset reduction, compared

to (27). Figure 10(a) shows the the response of the discharge air temperature to a time-varying set-point change.

Observe that the DRL controller is able to track the set-point changes in less than 50 seconds with small offset.

Moreover, it is clear that the response to the change in control action with (28) is smooth. Finally, the control

signal generated by the DRL controller is shown in Figure 10(b). This example demonstrates the efficacy of

Algorithm 5 in learning the control policy of complex nonlinear processes simply by interacting with the process

in real-time.



Episodes 100,000Mini-batch size M 64Replay memory size K 106


Further implementation details: We contrast the exceedingly long training time listed in Table 4 to generate

Figure 10 by noting from Figure 11 that the DRL controller was able to learn to track a fixed set-point, ysp = 30,

in fewer than 1000 episodes. In general, training the DRL controller with a single set-point is a reasonable starting

point for narrowing the parameter search when implementing Algorithm 5.

Example 4: Robustness to Process Changes

In this section, we evaluate the robustness of DRL controller to adapt to abrupt process changes. We revisit the

example from the pulp and paper industry in Section b.

We run Algorithm 5 continuously with the same specifications as in Example b for a fixed set-point ysp = 2.

As shown in Figures 12(a) and (b), in the approximate interval 0 ≤ t ≤ 1900, the DRL agent is learning the

29

Figure 11: Snapshots during training of the input and output signals – (a) progression of set-point tracking performance of the DRLcontroller; (b) the respective controller actions taken by the DRL controller.

control policy. Then at around t = 1900 we turn off the exploration noise and stop training the actor and critic

networks because the moving average of the errors |yt − 2| across the past four time steps was sufficiently small

(less than 0.0001 in our case). Next, at around t = 2900, a sudden process change is introduced by doubling the

process gain in (35). Observe that as soon as the process change is introduced, the policy learned by the DRL

controller is no longer valid. Consequently, the system starts to deviate from the set-point. However, since the

DRL controller learns in real-time, as soon as the process change is introduced, the controller starts re-learning a

new policy. Observe that for the same set-point ysp = 2, the DRL controller took about 400 seconds to learn the

new policy after the model change was introduced (see Figures 12(a) and (b) for t ≥ 2900). This demonstrates

how the DRL controller first learns the process dynamics and then learns the control policy – both in real-time,

without access to a priori process information.

Figure 12: Simulation results for Example 4 – (a) set-point tracking performance of the DRL controller under abrupt process changes;(b) the controller action taken by the DRL controller.

Comparison with Model Predictive Control

Model Predictive Control (MPC) is well-established and widely deployed. The proposed DRL controller has a

number of noteworthy differences.

30

(a) Process model: An accurate process model is a key ingredient in any model predictive controller. MPC

relies on model-based predictions and model-based optimization to compute the control action at each sampling

time74. In contrast, the DRL controller does not require any a priori access to a process model: it develops the

control policy in real time by interacting with the process.

(b) Constraints: Industrial control systems are subject to operational and environmental constraints that need

to be satisfied at all sampling times. MPC handles these constraints by explicitly incorporating them into the

optimization problem solved at each step75,76. In the DRL controller, constraints can be embedded directly in

the reward function or enforced through gradient-clipping—see (33). The latter approach is softer: constraint

violation may occur during training, but it is discouraged by triggering an enormous penalty in the reward signal.

Robust mechanisms for safe operation of a fully trained system, and indeed, for safe operation during online

training, are high priorities for further investigation.

(c) Prediction horizon: MPC refers to a user-specified prediction horizon when optimizing a process’s future

response. The duration of the MPC’s planning horizon determines the influence of future states; it may be either

finite or infinite, depending on the application. The corresponding adjustment in the DRL controller comes

through the discount factor γ ∈ [0, 1] used to determine the present value of future rewards.

(d) State estimation: The performance of MPC relies on an accurate estimate of the hidden process states over

the length of the horizon. The state estimation problem is typically formulated as a filtering problem77,78, with

Kalman filtering providing the standard tool for linear processes, while nonlinear systems remain an active area

of research79,80. In this paper, we consider the DRL controller for cases where the full system state is available

for policy improvement. Extending the design to include filtering and/or observer design will be the subject of

future work.

(e) Adaptation: Practical control systems change over time. Maintaining acceptable performance requires

responding to both changes and unknown disturbances. Traditional MPC-based systems include mechanisms for

detecting model-plant mismatch and responding as appropriate . . . typically by initiating interventions in the

process to permit re-identification of the process model. This process, which calls for simultaneous state and

parameter estimation, can be both challenging and expensive81,82,83. Some recent variants of MPC, such as

robust MPC and stochastic MPC, take model uncertainties into account by embedding them directly into the

optimization problem76. The DRL controller, on the other hand, updates the parameters in the actor and critic

networks at each sampling time using the latest experiences. This gives the DRL controller a self-tuning aspect,

so that process operations remain nearly optimal with respect to the selected reward criterion.

31

Limitations

In the decades since MPC was first proposed its theoretical foundations have become well established. Strong

convergence and stability proofs are available for several MPC formulations, covering both linear and nonlin-

ear systems84. At this comparatively early stage, the DRL controller comes with no such optimality guarantees.

Indeed, the assumptions of discrete-time dynamics and full state observation are both obvious challenges demand-

ing further investigation and development. (For example, while the temporal difference error for a discrete-time

system can be computed in a model-free environment, the TD error formulation for a continuous-time system

requires complete knowledge of its dynamics85.) Other issues of interest include the role of data requirements,

computational complexity, non-convexity, over- and under-fitting, performance improvement, hyper-parameter

selection and initialization.

Conclusions

We have developed an adaptive deep reinforcement learning (DRL) controller for set-point tracking problems

in discrete-time nonlinear processes. The DRL controller is a data-based controller based on a combination of

reinforcement learning and deep-learning. As a model-free controller, this DRL controller learns the control policy

in real time, simply by interacting with the process. The efficacy of the DRL controller has been demonstrated

in set-point tracking problems on a SISO, a MIMO and a non-linear system with external disturbances. The

DRL controller shows significant promise as an alternative to traditional model-based industrial controllers, even

though some practical and theoretical challenges remain to be overcome. As both computing power and data

volumes continue to increase, DRL controllers have the potential to become an important tool in process control.

Acknowledgements

We would like to thank MITACS for financial support through the Globalink Graduate Fellowship. The first

author would like to thank Kevin and Zilun for helpful discussions.

Basics of Neural Networks

An artificial neural network (NN) is a mapping F : U → Y from the input space U ⊆ Rnu to the output space

Y ⊆ Rny whose internal structure is inspired by natural networks of neurons, connected in several “layers”. Each

layer defines a simple input-output relation of the form

y = f(Wu), (.1)

32

where W ∈ Rny×nu is a weight matrix carrying the vector space of inputs to the vector space of outputs, and

f is a (typically nonlinear) “activation function” acting elementwise on its input. Common activation functions

include

f(u) = σ(u) :=1

1 + exp−u, f(u) = tanh(u) :=

exp2u−1

exp2u +1, f(u) = ReLU(u) := max {0, u} . (.2)

The weight matrices and activation functions may differ from layer to layer, so that a 3-layer NN model could

have the general form

y = W4f3(W3(f2(W2(f1(W1u))))). (.3)

(Choosing the identity for the final activation function is quite common.) Different sparsity patterns in the weight

matrices Wk allow for different connections between neurons in successive layers. Completing the definition of F

requires selecting the elements of the weight matrices Wk: an optimization problem is formulated to minimize the

discrepancy between the input-output pairs produced by the mapping F and a collection of input-output pairs

the network is expected to “learn”. Typically the dimensions ny and nu are very large, so the number of choice

variables in this optimization is enormous. Some form of gradient descent is the only practical scheme for the

iterative solution of the problem, a process evocatively called training the NN.

33

List of Figures

Fig. 1: A schematic of the agent-environment interactions in a standard RL problem.

Fig. 2: A schematic of the actor-critic architecture

Fig. 3: A deep neural network representation of (a) the actor, and (b) the critic. The red circles represent the

input and output layers and the black circles represent the hidden layers of the network.

Fig. 4: A schematic of the proposed DRL controller architecture for set-point tracking problems.

Fig. 5: Simulation results for Example 1 – (a) moving average with window size 21 (the number of distinct set-

points used in training) for the total reward per episode; (b) tracking performance of the DRL controller on a

sinusoidal reference signal; and (c) the input signal generated by the DRL controller.

Fig. 6: Simulation results for Example 1 – (a) tracking performance of the DRL controller on randomly generated

set-points outside of the training interval [0, 10]; (b) the input signal generated by the DRL controller.

Fig. 7: A schematic of a two-product distillation column in Example 2. Adapted from66.

Fig. 8: Simulation results for Example 2 – (a) and (b) set-point tracking for distillate and bottom product,

respectively, where the different colors correspond to different starting points; (c) and (d) boilup and reflux rates

selected by the DRL controller, respectively.

Fig. 9: Set-point selection for Example b – (a) the distribution of settled outputs after simulating the MIMO

system (37) with uniformly sampled actions in [0, 50] × [0, 50]; (b) the corresponding action pairs such that the

settled outputs are both within [0, 5], along with a linear regression of these samples.

Fig. 10: Simulation results for Example 3 – (a) the response of the discharge air temperature to the set-point

change and (b) the control signal output calculated by the DRL controller.

Fig. 11: Snapshots during training of the input and output signals – (a) progression of set-point tracking perfor-

mance of the DRL controller; (b) the respective controller actions taken by the DRL controller.

Fig. 12: Simulation results for Example 4 – (a) set-point tracking performance of the DRL controller under abrupt

process changes; (b) the controller action taken by the DRL controller.

34

References

[1] Tulsyan A, Garvin C, Undey C. Machine-learning for biopharmaceutical batch process monitoring with limited data. IFAC-

PapersOnLine 2018;51(18):126–131.

[2] Tulsyan A, Garvin C, Undey C. Advances in industrial biopharmaceutical batch process monitoring: Machine-learning methods

for small data problems. Biotechnology and bioengineering 2018;115(8):1915–1924.

[3] Tulsyan A, Garvin C, Undey C. Industrial batch process monitoring with limited data. Journal of Process Control 2019;77:114–

133.

[4] Kano M, Ogawa M. The state of the art in advanced chemical process control in Japan. IFAC Proceedings Volumes

2009;42(11):10–25.

[5] Tulsyan A, Khare SR, Huang B, Gopaluni RB, Forbes JF. Bayesian identification of non-linear state-space models: Part I-Input

design. IFAC Proceedings Volumes 2013;46(32):774–779.

[6] Seborg DE, Edgar TF, Shah SL. Adaptive control strategies for process control: a survey. AIChE Journal 1986;32(6):881–913.

[7] Krstic M, Kanellakopoulos I, Kokotovic PV. Nonlinear and Adaptive Control Design. ohn Wiley & Sons, New York; 1995.

[8] Mailleret L, Bernard O, Steyer JP. Nonlinear adaptive control for bioreactors with unknown kinetics. Automatica

2004;40(8):1379–1385.

[9] Guan C, Pan S. Adaptive sliding mode control of electro-hydraulic system with nonlinear unknown parameters. Control

Engineering Practice 2008;16(11):1275–1284.

[10] Spielberg SPK, Gopaluni RB, Loewen PD. Deep reinforcement learning approaches for process control. In: Proceedings of the

6th IFAC Symposium on Advanced Control of Industrial Processes Taipei, Taiwan; 2017. p. 201–206.

[11] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT Press, Cambridge; 1998.

[12] Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on

Decision and Control, vol. 1 IEEE Publ. Piscataway, NJ; 1995. p. 560–564.

[13] Bertsekas DP. Dynamic Programming and Optimal Control. Athena Scientific, MA; 2005.

[14] Powell WB. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, New York; 2007.

[15] Sugiyama M. Statistical Reinforcement Learning: Modern Machine Learning Approaches. CRC Press, Florida; 2015.

[16] Bellman R. Dynamic Programming. Dover Publications, New York; 2003.

[17] Lehnert L, Precup D. Policy Gradient Methods for Off-policy Control. arXiv Preprint, arXiv:151204105 2015;.

[18] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. arXiv

Preprint, arXiv:150902971 2015;.

[19] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning.

arXiv Preprint, arXiv:13125602 2013;.

[20] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement

learning. Nature 2015;518:529–533.

[21] Pednault E, Abe N, Zadrozny B. Sequential cost-sensitive decision making with reinforcement learning. In: Proceedings of the

8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Edmonton, Canada; 2002. p. 259–268.

[22] Tesauro G. Temporal difference learning and TD-Gammon. Communications of the ACM 1995;38(3):58–68.

[23] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, et al. Mastering the game of Go with deep neural

networks and tree search. Nature 2016;529:484–489.

35

[24] Badgwell TA, Lee JH, Liu KH. Reinforcement Learning–Overview of Recent Progress and Implications for Process Control. In:

Computer Aided Chemical Engineering, vol. 44 Elsevier; 2018.p. 71–85.

[25] Lee JH, Lee JM. Approximate dynamic programming based approach to process control and scheduling. Computers & Chemical

Engineering 2006;30(10-12):1603–1618.

[26] Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement learning and feedback control: using natural decision methods to design

optimal adaptive controllers. IEEE Control Systems 2012;32(6):76–105.

[27] Morinelly JE, Ydstie BE. Dual mpc with reinforcement learning. IFAC-PapersOnLine 2016;49(7):266–271.

[28] Lee JM, Kaisare NS, Lee JH. Choice of approximator and design of penalty function for an approximate dynamic programming

based control approach. Journal of process control 2006;16(2):135–156.

[29] Lee JM, Lee JH. Neuro-dynamic programming method for MPC1. IFAC Proceedings Volumes 2001;34(25):143–148.

[30] Kaisare NS, Lee JM, Lee JH. Simulation based strategy for nonlinear optimal control: application to a microbial cell reactor.

International Journal of Robust and Nonlinear Control 2003;13(3-4):347–363.

[31] Lee JM, Lee JH. Simulation-based learning of cost-to-go for control of nonlinear processes. Korean Journal of Chemical

Engineering 2004;21(2):338–344.

[32] Lee JM, Lee JH. Approximate dynamic programming-based approaches for input–output data-driven control of nonlinear

processes. Automatica 2005;41(7):1281–1288.

[33] Al-Tamimi A, Lewis FL, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming:

convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B) 2008;38(4):943–949.

[34] Si J, Wang YT. Online learning control by association and reinforcement. IEEE Transactions on Neural networks 2001;12(2):264–

276.

[35] Heydari A, Balakrishnan SN. Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics.

IEEE Transactions on Neural Networks and Learning Systems 2013;24(1):145–157.

[36] Wang D, Liu D, Wei Q, Zhao D, Jin N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive

dynamic programming. Automatica 2012;48(8):1825–1832.

[37] Vrabie D, Lewis F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear

systems. Neural Networks 2009;22(3):237–246.

[38] Vamvoudakis KG, Lewis FL. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem.

Automatica 2010;46(5):878–888.

[39] Liu D, Wang D, Li H. Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online

learning optimal control approach. IEEE Transactions on Neural Networks and Learning Systems 2014;25(2):418–428.

[40] Mu C, Wang D, He H. Novel iterative neural dynamic programming for data-based approximate optimal control design.

Automatica 2017;81:240–252.

[41] Luo B, Wu HN, Huang T, Liu D. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control

design. Automatica 2014;50(12):3281–3290.

[42] Wang D, Liu D, Zhang Q, Zhao D. Data-based adaptive critic designs for nonlinear robust optimal control with uncertain

dynamics. IEEE Transactions on Systems, Man, and Cybernetics: Systems 2016;46(11):1544–1555.

[43] Tang W, Daoutidis P. Distributed adaptive dynamic programming for data-driven optimal control. Systems & Control Letters

2018;120:36–43.

[44] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Proceedings of the

36

Advances in Neural Information Processing Systems; 2012. p. 1097–1105.

[45] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic policy gradient algorithms. In: Proceedings of

the 31st International Conference on Machine Learning Beijing, China; 2014. .

[46] Watkins CJCH. Learning from delayed rewards. PhD thesis, King’s College, Cambridge; 1989.

[47] Watkins CJ, Dayan P. Q-learning. Machine learning 1992;8(3-4):279–292.

[48] Sutton RS. Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Proceedings of the

Advances in Neural Information Processing Systems Denver, USA; 1996. p. 1038–1044.

[49] Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of LSTD. In: Proceedings of the 27th International Conference

on Machine Learning Haifa, Israel; 2010. p. 615–622.

[50] Tsitsiklis JN, Van Roy B. Analysis of temporal-diffference learning with function approximation. In: Proceedings of the Advances

in Neural Information Processing Systems Denver, USA; 1997. p. 1075–1081.

[51] Baird L. Residual algorithms: Reinforcement learning with function approximation. In: Machine Learning Proceedings 1995

Elsevier; 1995.p. 30–37.

[52] Fairbank M, Alonso E. The Divergence of Reinforcement Learning Al-gorithms with Value-Iteration and Function Ap-

proximation. arXiv preprint arXiv:11074606 2011;.

[53] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research

2013;32(11):1238–1274.

[54] Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient methods for reinforcement learning with function approxi-

mation. In: Proceedings of the Advances in Neural Information Processing Systems; 2000. p. 1057–1063.

[55] Degris T, White M, Sutton RS. Off-policy actor-critic. arXiv Preprint, arXiv:12054839 2012;.

[56] Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In: Reinforcement

Learning Springer; 1992.p. 5–32.

[57] Riedmiller M, Peters J, Schaal S. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In: Proceedings

of the International Symposium on Approximate Dynamic Programming and Reinforcement Learning Honolulu, Hawaii; 2007.

p. 254–261.

[58] Konda VR, Tsitsiklis JN. Actor-critic algorithms. In: Proceedings of the Advances in Neural Information Processing Systems

Denver, USA; 2000. p. 1008–1014.

[59] Bhatnagar S, Ghavamzadeh M, Lee M, Sutton RS. Incremental natural actor-critic algorithms. In: Proceedings of the Advances

in Neural Information Processing Systems Vancouver, Canada; 2008. p. 105–112.

[60] Hausknecht M, Stone P. Deep reinforcement learning in parameterized action space. arXiv Preprint, arXiv:151104143 2015;.

[61] Uhlenbeck GE, Ornstein LS. On the theory of the Brownian motion. Physical review 1930;36(5):823–841.

[62] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth

international conference on artificial intelligence and statistics 2010;1:249–256.

[63] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv Preprint,

arXiv:150203167 2015;.

[64] Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv Preprint, arXiv:14126980 2014;.

[65] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. Tensorflow: large-scale machine learning on heterogeneous

distributed systems. arXiv Preprint, arXiv:160304467 2016;.

[66] Skogestad S, Morari M, Doyle JC. Robust control of ill-conditioned plants: high-purity distillation. IEEE Transactions on

37

Automatic Control 1988;33(12):1092–1105.

[67] Waller JB, Boling JM. Multi-variable nonlinear MPC of an ill-conditioned distillation column. Journal of Process Control

2005;15(1):23–29.

[68] Mollov S, Babuska R. Analysis of interactions and multivariate decoupling fuzzy control for a binary distillation column.

International Journal of Fuzzy Systems 2004;6(2):53–62.

[69] Wang YG, Shi ZG, Cai WJ. PID autotuner and its application in HVAC systems. In: Proceedings of the American Control

Conference, vol. 3 Arlington, USA; 2001. p. 2192–2196.

[70] Wang Y, Velswamy K, Huang B. A Long-Short Term Memory Recurrent Neural Network Based Reinforcement Learning

Controller for Office Heating Ventilation and Air Conditioning Systems. Processes 2017;5(3):1–18.

[71] Moradi H, Saffar-Avval M, Bakhtiari-Nejad F. Nonlinear multivariable control and performance analysis of an air-handling unit.

Energy and Buildings 2011;43(4):805–813.

[72] Greensfelder EM, Henze GP, Felsmann C. An investigation of optimal control of passive building thermal storage with real time

pricing. Journal of Building Performance Simulation 2011;4(2):91–104.

[73] Underwood DM, Crawford RR. Dynamic Nonlinear Modeling of a Hot-Water-to-Air Heat Exchanger for Control Applications.

American Society of Heating, Refrigerating and Air-Conditioning Engineers; 1991.

[74] Pon Kumar SS, Tulsyan A, Gopaluni B, Loewen P. A Deep Learning Architecture for Predictive Control. In: Proceedings of

the 10th IFAC Conference on the Advanced Control of Chemical Processes Shenyang, China; 2018. .

[75] Wallace M, Pon Kumar SS, Mhaskar P. Offset-free model predictive control with explicit performance specification. Industrial

& Engineering Chemistry Research 2016;55(4):995–1003.

[76] Lee JH, Wong W. Approximate dynamic programming approach for process control. Journal of Process Control 2010;20(9):1038–

1048.

[77] Tulsyan A, Tsai Y, Gopaluni RB, Braatz RD. State-of-charge estimation in lithium-ion batteries: A particle filter approach.

Journal of Power Sources 2016;331:208–223.

[78] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. A particle filter approach to approximate posterior Cramer-Rao lower bound:

The case of hidden states. IEEE Transactions on Aerospace and Electronic Systems 2013;49(4):2478–2495.

[79] Tulsyan A, Gopaluni RB, Khare SR. Particle filtering without tears: a primer for beginners. Computers & Chemical Engineering

2016;95(12):130–145.

[80] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. Performance assessment, diagnosis, and optimal selection of non-linear state

filters. Journal of Process Control 2014;24(2):460–478.

[81] Tulsyan A, Khare S, Huang B, Gopaluni B, Forbes F. A switching strategy for adaptive state estimation. Signal Processing

2018;143:371–380.

[82] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. On simultaneous on-line state and parameter estimation in non-linear state-space

models. Journal of Process Control 2013;23(4):516–526.

[83] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. Bayesian identification of non-linear state-space models: Part II-Error Analysis.

IFAC Proceedings Volumes 2013;46(32):631–636.

[84] Mayne DQ. Model predictive control: Recent developments and future promise. Automatica 2014;50(12):2967–2986.

[85] Bhasin S. Reinforcement learning and optimal control methods for uncertain nonlinear systems. University of Florida; 2011.

38

Towards Self-Driving Processes: A Deep Reinforcement Learning Approach to Control · 2021. 1....

Documents

Transcript of Towards Self-Driving Processes: A Deep Reinforcement Learning Approach to Control · 2021. 1....