Towards Self-Driving Processes: A Deep Reinforcement Learning Approach to Control · 2021. 1....
Transcript of Towards Self-Driving Processes: A Deep Reinforcement Learning Approach to Control · 2021. 1....
Towards Self-Driving Processes: A Deep Reinforcement Learning Approach toControl
Steven Spielberga, Aditya Tulsyana, Nathan P. Lawrenceb, Philip D Loewenb, R. Bhushan Gopalunia,∗
aDepartment of Chemical and Biological Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.bDepartment of Mathematics, University of British Columbia, Vancouver, BC V6T 1Z2, Canada.
Abstract
Advanced model-based controllers are well established in process industries. However, such controllers require
regular maintenance to maintain acceptable performance. It is a common practice to monitor controller per-
formance continuously and to initiate a remedial model re-identification procedure in the event of performance
degradation. Such procedures are typically complicated and resource-intensive, and they often cause costly inter-
ruptions to normal operations. In this paper, we exploit recent developments in reinforcement learning and deep
learning to develop a novel adaptive, model-free controller for general discrete-time processes. The DRL controller
we propose is a data-based controller that learns the control policy in real time by merely interacting with the
process. The effectiveness and benefits of the DRL controller are demonstrated through many simulations.
Keywords: process control; model-free learning; reinforcement learning; deep learning; actor-critic networks
Introduction
Industrial process control is a large and diverse field; its broad range of applications calls for a correspondingly
wide range of controllers—including single and multi-loop PID controllers, model predictive controllers (MPCs),
and a variety of nonlinear controllers. Many deployed controllers achieve robustness at the expense of performance.
The overall performance of a controlled process depends on the characteristics of the process itself, the controller’s
overall architecture, and the tuning parameters that are employed. Even if a controller is well-tuned at the time
of installation, drift in process characteristics or deliberate set-point changes can cause performance to deteriorate
over time1,2,3. Maintaining system performance over the long-term is essential. Unfortunately, it is typically also
both complicated and expensive.
Most modern industrial controllers are model-based, so good performance calls for a high-quality process
model. It is standard practice to continuously monitor system performance and initiate a remedial model re-
identification exercise in the event of performance degradation. Model re-identification can require two weeks or
more4, and typically involves the injection of external excitations5, which introduce an expensive interruption
∗Corresponding author. Tel:+1 604 827 5668Email address: [email protected] (R. Bhushan Gopaluni)
Preprint submitted to AIChE Journal June 1, 2019
to the normal operation of the process. Re-identification is particularly complicated for multi-variable processes,
which require a model for every input-output combination.
Most classical controllers in industry are linear and non-adaptive. While extensive work has been done in
nonlinear adaptive control6,7, it has not yet established a significant position in the process industry, beyond
several niche applications8,9. The difficulty of learning (or estimating) reliable multi-variable models in an online
fashion is partly responsible for this situation. Other contributing factors include the presence of hidden states,
process dimensionality, and in some cases computational complexity.
Given the limitations of existing industrial controllers, we seek a new design that can learn the control policy
for discrete-time nonlinear stochastic processes in real time, in a model-free and adaptive environment. This paper
continues our recent investigations10 on the same topic. The idea of RL has been around for several decades;
however, its application to process control has been somewhat recent. Next, we provide a short introduction to
RL and its methods and also highlight existing RL-based approaches for control problems.
Reinforcement Learning and Process Control
Reinforcement Learning (RL) is an active area of research in artificial intelligence. It originated in computer sci-
ence and operations research to solve complex sequential decision-making problems11,12,13,14. The RL framework
comprises an agent (e.g., a controller) interacting with a stochastic environment (e.g., a plant) modelled as a
Markov decision process (MDP). The goal in an RL problem is to find a policy (or feedback controller) that is
optimal in a certain sense15.
Over the last three decades, several methods, including dynamic programming (DP), Monte Carlo (MC) and
temporal-difference learning (TD) have been proposed to solve the RL problem11. Most of these compute the
optimal policy using policy iteration. This is an iterative approach in which every step involves both policy
estimation and policy improvement. The policy estimation step aims at making the value function consistent
with the current policy; the policy improvement step makes the policy greedy with respect to the current estimate
of the value function. Alternating between the two steps produces a sequence of value function estimates and sub-
optimal policies which, in the limit, converge to the optimal value function and the optimal policy, respectively11.
The choice of a solution strategy for an RL problem is primarily driven by the assumptions on the environ-
ment and the agent. For example, under the perfect model assumption for the environment, classical dynamic
programming methods for Markov Decision Problems with finite state and action spaces) provide a closed-form
solution to the optimal value function, and are known to converge to the optimal policy in polynomial time13,16.
Despite their strong convergence properties, however, classical DP methods have limited practical applications be-
2
cause of their stringent requirement for a perfect model of the environment, which is seldom available in practical
problems.
Monte Carlo algorithms belong to a class of approximate RL methods that can be used to estimate the value
function using experiences (i.e., sample sequences of states, actions, and rewards accumulated through the agent’s
interaction with an environment). MC algorithms offer several advantages over DP. First, MC methods allow an
agent to learn the optimal behaviour directly by interacting with the environment, thereby eliminating the need
for an exact model. Second, MC methods can be focused, to estimate value functions for a small subset of the
states of interest rather than evaluating them for the entire state space, as with DP methods. This significantly
reduces the computational burden, since value function estimates for less relevant states need not be updated.
Third, MC methods may be less sensitive to the violations of the Markov property of the system. Despite the
advantages mentioned above, MC methods are difficult to implement in real time, as the value function estimates
can only be updated at the end of an experiment. Further, MC methods are also known to exhibit slow
convergence as they do not bootstrap, i.e., they do not update their value function from other value estimates11.
Finally, the efficacy of MC methods in RL remains unsettled and is a subject of ongoing research11.
TD learning is another class of approximate methods for solving RL problems that combine ideas from DP
and MC. Like MC algorithms, TD methods can learn directly from raw experiences without requiring a model;
and like DP, TD methods update the value function in real time without having to wait until the end of the
experiment. For a detailed exposition on RL solutions, the reader is referred to11 and the references therein.
Reinforcement Learning has achieved remarkable success in robotics17,18, computer games19,20, online adver-
tising21, and board games22,23; however, its adaptation to process control has been limited (see Badgwell et al. 24
for recent survey of RL methods in process control) —even though many optimal scheduling and control problems
can be formulated as MDPs25. This is primarily due to lack of efficient RL algorithms to deal with infinite MDPs
(i.e., MDPs with continuous state and action spaces) that define most modern control systems. While existing
RL methods apply to infinite MDPs, exact solutions are possible only in special cases, such as in linear quadratic
(Gaussian) control problem, where DP provides a closed-form solution26,27. It is plausible to discretize infinite
MDPs and use DP to estimate the value function; however, this leads to an exponential growth in the computa-
tional complexity with respect to the states and actions, which is referred to as curse of dimensionality13. The
computational and storage requirements for discretization methods applied to most problems of practical interest
in process control remain unwieldy even with todays computing hardware28.
The first successful implementation of RL in process control appeared in the series of papers published in the
early 2000s25,28,29,30,31,32, where the authors proposed approximate dynamic programming (ADP) for optimal
3
control of discrete-time nonlinear systems. The idea of ADP is rooted in the formalism of DP but uses simulations
and function approximators (FAs) to alleviate the curse of dimensionality. Other RL methods based on heuristic
dynamic programming (HDP)33, direct HDP34, dual heuristic programming35, and globalized DHP36 have also
been proposed for optimal control of discrete-time nonlinear systems. RL methods have also been proposed for
optimal control of continuous-time nonlinear systems37,38,39. However, unlike discrete-time systems, controlling
continuous-time systems with RL has proven to be considerably more difficult and fewer results are available26.
While the contributions mentioned above establish the feasibility and adaptability of RL in controlling discrete-
time and continuous-time nonlinear processes, most of these methods assume complete or partial access to process
models25,28,29,30,31,38,39. This limits existing RL methods to processes for which high-accuracy models are either
available or can be derived through system identification.
Recently, several data-based approaches have been proposed to address the limitations of model-based RL in
control. In Lee and Lee 32 , Mu et al. 40 , a data-based learning algorithm was proposed to derive an improved
control policy for discrete-time nonlinear systems using ADP with an identified process model, as opposed to an
exact model. Similarly, Lee and Lee 32 proposed a Q-learning algorithm to learn an improved control policy in
a model-free manner using only input-output data. While these methods remove the requirement for having an
exact model (as in RL), they still present several issues. For example, the learning method proposed in Lee and
Lee 32 , Mu et al. 40 is still based on ADP, so its performance relies on the accuracy of the identified model. For
complex, nonlinear, stochastic systems, identifying a reliable process model may be nontrivial as it often requires
running multiple carefully designed experiments. Similarly, the policy derived from Q-learning in Lee and Lee 32
may converge to a sub-optimal policy as it avoids adequate exploration of the state and action spaces. Further,
calculating the policy with Q-learning for infinite MDPs requires solving a non-convex optimization problem over
the continuous action space at each sampling time, which may render it unsuitable for online deployment for
processes with small time constants or modest computational resources. Note that data-based RL methods have
also been proposed for continuous-time nonlinear systems41,42. Most of these data-based methods approximate
the solution to the Hamilton-Jacobi-Bellman (HJB) equation derived for a class of continuous-time affine nonlinear
systems using policy iteration. For more information on RL-based optimal control of continuous-time nonlinear
systems, the reader is referred to Luo et al. 41 , Wang et al. 42 , Tang and Daoutidis 43 and the references therein.
The promise of RL to deliver a real-time, self-learning controller in a model-free and adaptive environment
has long motivated the process systems community to explore novel approaches to apply RL in process control
applications. While the plethora of studies from over last two decades has provided significant insights to help
connect the RL paradigm with process control, they also highlight the nontrivial nature of this connection and
4
the limitations of existing RL methods in control applications. Motivated by recent developments in the area of
deep reinforcement learning (DRL), we revisit the problem of RL-based process control and explore the feasibility
of using DRL to bring these goals one step closer.
Deep Reinforcement Learning (DRL)
The recent resurgence of interest in RL results from the successful combination of RL with deep learning that
allows for effective generalization of RL to MDPs with continuous state spaces. The Deep-Q-Network (DQN)
proposed recently in Mnih et al. 19 combines deep learning for sensory processing44 with RL to achieve human-
level performance on many Atari video games. Using unprocessed pixels as inputs, simply through interactions,
the DQN can learn in real time the optimal strategy to play Atari. This was made possible using a deep neural
network function approximator (FA) to estimate the action-value function over the continuous state space, which
was then maximized over the action space to find the optimal policy. Before DQN, learning the action-value
function using FAs was widely considered difficult and unstable. Two innovations account for this latest gain
in stability and robustness: (a) the network is trained off-policy with samples from a replay buffer to minimize
correlations between samples, and (b) the network is trained with a target Q-network to give consistent targets
during temporal difference backups.
In this paper, we propose an off-policy actor-critic algorithm, referred to as a DRL controller, for controlling of
discrete-time nonlinear processes. The proposed DRL controller is a model-free controller designed based on TD
learning. As a data-based controller, the DRL controller uses two independent deep neural networks to generalize
the actor and critic to continuous state and action spaces. The DRL controller is based on the deterministic policy
gradient (DPG) algorithm proposed by Silver et al. 45 that combines an actor-critic method with insights from
DQN. The DRL controller uses ideas similar to those in Lillicrap et al. 18 , modified to make learning suitable for
process control applications. Several simulation examples of different complexities are presented to demonstrate
the efficacy of the DRL controller in set-point tracking problems.
The rest of the paper is organized as follows: in Section b, we introduce the basics of MDP and derive a
control policy based on value functions. Motivated by the intractability of optimal policy calculations for infinite
MDPs, we introduce Q-learning in Section b for solving RL problem over continuous state space. A policy
gradient algorithm is discussed in Section b to extend the RL solution to continuous action space. Combining
the developments in Sections b and b, a novel actor-critic algorithm is discussed in Section b. In Section b, a
DRL framework is proposed for data-based control of discrete-time nonlinear processes. The efficacy of a DRL
controller is demonstrated on several examples in Section b. Finally, Section b compares a DRL controller to an
5
MPC.
This paper follows a tutorial-style presentation to assist readers unfamiliar with the theory of RL. The material
is systematically introduced to highlight the challenges with existing RL solutions in process control applications
and to motivate the development of the proposed DRL controller. The background material presented here is
only introductory, and readers are encouraged to refer to the cited references for a detailed exposition on these
topics.
The Reinforcement Learning (RL) Problem
The RL framework consists of a learning agent (e.g., a controller) interacting with a stochastic environment (e.g.,
a plant or process), denoted by E , in discrete time steps. The objective in an RL problem is to identify a policy
(or control actions) to maximize the expected cumulative reward the agent receives in the long run11,15. The
RL problem is a sequential decision-making problem, in which the agent incrementally learns how to optimally
interact with the environment by maximizing the expected reward it receives. Intuitively, the agent-environment
interaction is to be understood as follows. Given a state space S and an action space A, the agent at time
step t ∈ N observes some representation of the environment’s state st ∈ S and on that basis selects an action
at ∈ A. One time step later, in part as a consequence of its action, the agent finds itself in a new state st+1 ∈ S
and receives a scalar reward rt ∈ R from the environment indicating how well the agent performed at t ∈ N.
This procedure repeats for all t ∈ N, as in the case of a continuing task or until the end of an episode. The
agent-environment interactions are illustrated in Figure 1.
Agent
Environment
Act
ion
at
Rew
ard
r t
Sta
tes t
st+1
rt+1
Figure 1: A schematic of the agent-environment interactions in a standard RL problem.
Markov Decision Process (MDP)
A Markov Decision Process (MDP) consists of the following: a state space S; an action space A; an initial state
distribution p(s1); and a transition distribution p(st+1|st, at)1 satisfying the following Markov property
p(st+1|st, at, . . . , s1, a1) = p(st+1|st, at), (1)
1To simplify notation, we drop the random variable in the conditional density and write p(st+1|st, at) = p(st+1|St = st, At = at).
6
for any trajectory s1, a1 . . . , sT , aT generated in the state-action space S ×A (for continuing tasks, T →∞, while
for episodic tasks, T ∈ N is the terminal time); and a reward function rt : S ×A → R. In an MDP, the transition
function (1) is the likelihood of observing a state st+1 ∈ S after the agent takes an action at ∈ A in state st ∈ S.
Thus∫S p(s|st, at) ds = 1 for all (st, at) ∈ S ×A.
A policy is used to describe the actions of an agent in the MDP. A stochastic policy is denoted by π : S → P(A),
where P(A) is the set of probability measures on A. Thus π(at|st) is the probability of taking action at when
in state st; we have∫A π(a|st) da = 1 for each st ∈ S. This general formulation allows any deterministic policy
µ : S → A, through the definition
π(a|st) =
1 if a = µ(st),
0 otherwise.
(2)
The agent uses π to sequentially interact with the environment to generate a sequence of states, actions and
rewards in S × A × R, denoted generically as h = (s1, a1, r1, . . . , sT , aT , rT ), where (H = h) ∼ pπ(h) is an
arbitrary history generated under policy π and distributed according to the probability density function (PDF)
pπ(·). Note that under the Markov assumption of the MDP, the PDF for the history can be decomposed as
follows:
pπ(h) = p(s1)
T∏t=1
p(st+1|st, at)π(at|st).
The total reward accumulated by the agent in a task from time t ∈ N onward is
Rt(h) =
∞∑k=t
γk−trt(sk, ak), (3)
Here γ ∈ [0, 1] is a (user-specified) discount factor: a unit reward has the value 1 right now, γ after one time step,
and γτ after τ sampling intervals. If we specify γ = 0, only the immediate reward has any influence; if we let
γ → 1, future rewards are considered more strongly—informally, the controller becomes ‘farsighted.’ While (3)
is defined for continuing tasks, i.e., T →∞, the notation can be applied also for episodic tasks by introducing a
special absorbing terminal state that transitions only to itself and generates rewards of 0. These conventions are
used to simplify the notation and to express close parallels between episodic and continuing tasks. See Sutton
and Barto 11 .
Value Function and Optimal Policy
The state-value function, V π : S → R assigns a value to each state, according to the given policy π. In state st,
V π(st) is the value of the future reward an agent is expected to receive by starting at st ∈ S and following policy
7
π thereafter. In detail,
V π(st) = Eh∼pπ(·)[Rt(h)|st]. (4)
The closely-related action-value function Qπ : S × A → R decouples the immediate action from the policy π,
assuming only that π is used for all subsequent steps:
Qπ(st, at) = Eh∼pπ(·)[Rt(h)|st, at]. (5)
The Markov property of the underlying dynamic process gives these two value functions a recurrent structure
illustrated below:
Qπ(st, at) = Est+1∼p(·|st,at) [r(st, at) + γπ(at+1|st+1)Qπ(st+1, at+1)] . (6)
Solving an RL problem amounts to finding a policy π? that outperforms all other policies across all possible
scenarios. Identifying π∗ will yield an optimal Q-function, Q?, such that
Q?(st, at) = maxπ
Qπ(st, at), (st, at) ∈ S ×A. (7)
Conversly, knowing the function Q? is enough to recover an optimal policy by making a “greedy” choice of action:
π?(a|st) =
1, if a = arg maxa∈AQ
?(st, a),
0, otherwise.
(8)
Note that this policy is actually deterministic. While solving the Bellman equation in (6) for the Q-function
provides an approach to finding an optimal policy in (8), and thus solving the RL problem, this solution is
rarely useful in practice. This is because the solution relies on two key assumptions – (a) the dynamics of the
environment is accurately known, i.e., p(st+1|st, at) is exactly known for all (st, at) ∈ S × A; and (b) sufficient
computational resources are available to calculate Qπ(st, at) for all (st, at) ∈ S × A. These assumptions are
a major impediment for solving process control problems, wherein, complex process behaviour might not be
accurately known, or might change over time, and the state and action spaces may be continuous. For an RL
solution to be practical, one typically needs to settle for approximate solutions. In the next section, we introduce
Q-learning that approximates the optimal Q-function (and thus the optimal policy) using the agent’s experiences
(or samples) as opposed to process knowledge. Such class of approximate solutions to the RL problem is called
the model-free RL methods.
Q-learning
Q-learning is one of the most important breakthroughs in RL46,47. The idea is to learn Q? directly, instead of
first learning Qπ and then computing Q? in (7). Q-learning constructs Q? through successive approximations.
8
Similar to (6), using the Bellman equation, Q? satisfies the identity
Q?(st, at) = Est+1∼p(·|st,at)[r(st, at) + γ maxa′∈A
Q?(st+1, a′)]. (9)
There are several ways to use (9) to compute Q?. The standard Q-iteration (QI) is a model-based method
that requires complete knowledge of the states, the transition function, and the reward function to evaluate
the expectation in (9). Alternatively, temporal difference (TD) learning is a model-free method that uses sam-
pling experiences, (st, at, rt, st+1), to approximate Q? 11. This is done as follows. First, the agent explores the
environment by following some stochastic behaviour policy, β : S → P(A), and receiving an experience tuple
(st, at, rt, st+1) at each time step. The generated tuple is then used to improve the current approximation of Q?,
denoted Qi, as follows:
Qi+1(st, at)← Qi(st, at) + αδ, (10)
where α ∈ (0, 1] is the learning rate, and δ is the TD error, defined by
δ = r(st, at) + γ maxa′∈A
Qi(st+1, a′)− Qi(st, at). (11)
The conditional expectation in (9) falls away in TD learning in (10), since st+1 in (st, at, rt, st+1) is distributed
according to the target density, p( · |st, at). Making the greedy policy choice using Qi+1 instead of Q? (recall (8))
produces
πi+1(st) = arg maxa∈A
Qi+1(st, a), (12)
where πi+1 is a greedy policy based on Qi+1. In the RL literature, (10) is referred to as policy evaluation and (12)
as policy improvement. Together, steps (10) and (12) are called Q-learning47. Pseudo-code for implementing this
approach is given in Algorithm 1. Algorithm 1 is a model-free, on-line and off-policy algorithm. It is model-free as
it does not require an explicit model of the environment, and on-line because it only utilizes the latest experience
tuple to implement the policy evaluation and improvement steps. Further, Algorithm 1 is off-policy because the
agent acts in the environment according to its behaviour policy β, but still learns its own policy, π. Observe that
the behaviour policy in Algorithm 1 is ε-greedy, in that it generates greedy actions for the most part but has a
non-zero probability, ε, of generating a random action. Note that off-policy is a critical component in RL as it
ensures a combination of exploration and exploitation. Finally, for Algorithm 1, it can be shown that Qi → Q?
with probability 1 as i→∞47.
Q-learning with Function Approximation
While Algorithm 1 enjoys strong theoretical convergence properties, it requires storing the Q-values for all state-
action pairs (s, a) ∈ S × A. For applications like control, where both S and A are infinite sets, some further
9
Algorithm 1 Q-learning
1: Output: Action-value function Q(s, a)2: Initialize: Arbitrarily set Q, e.g., to 0 for all states, set Q for terminal states as 03: for each episode do4: Initialize state s5: for each step of episode, state s is not terminal do6: a← action for s derived by Q, e.g., ε-greedy7: take action a, observe r and s′
8: δ ← r + γmaxa′ Q(s′, a′)−Q(s, a)9: Q(s, a)← Q(s, a) + αδ
10: s← s′
11: end for12: end for
simplification is required. Several authors have proposed space discretization methods48. While such discretiza-
tion methods may work for simple problems, in general, they are not efficient in capturing the complex dynamics
of industrial processes.
The problem of generalizing Q-learning from finite to continuous spaces has been studied extensively over
the last two decades. The basic idea in continuous spaces is to use a function approximator, or FA. A function
approximator Q(s, a, w) is a parametric function Q(s, a, w), whose parameters w are chosen to make Q(s, a, w) ≈
Q(s, a) for all (s× a) ∈ S ×A. This is achieved by minimizing the following quadratic loss function L:
Lt(w) = Est∼ρβ(·),at∼β(·|st)[(yt −Q(st, at, w))2
], (13)
where yt is the target and ρβ is a discounted state visitation distribution under behaviour policy β. The role of ρβ
is to weight L based on how frequently a particular state is expected to be visited. Further, as in any supervised
learning problem, the target, yt is given by Q?(st, at); however, since Q? is unknown, it can be replaced with its
approximation. A popular choice of an approximate target is a bootstrap target (or a TD target), given as
yt = Est+1∼p(·|st,at)[r(st, at) + γ maxa′∈A
Q(st+1, a′, w)]. (14)
In contrast to supervised learning, where the target is typically independent of model parameters, the target in
(14) depends on the FA parameters. Finally, (13) can be minimized using a stochastic gradient descent (SGD)
algorithm. An SGD is an iterative optimization method that adjusts w in the direction that would most reduce
L(w) for it. The update step for SGD is given as follows
wt+1 ← wt −1
2αc,t∇Lt(wt), (15)
where wt and wt+1 are the old and new parameter values, respectively, and αc,t is a positive step-size parameter.
Given (13), this gradient can be calculated as follows
∇Lt(wt) = −2Est∼ρβ(·),at∼β(·|st)[yt −Q(st, at, wt)]∇wQ(st, at, wt). (16)
10
To derive (16), it is assumed that yt is independent of w. Note that this is a common assumption in Q-learning
with TD targets11. Finally, after updating wt+1 (and computing Q(st+1, a′, wt+1)), the optimal policy can be
computed as follows
π(st+1) = arg maxa′∈A
Q(st+1, a′, wt+1). (17)
The pseudo-code for Q-learning with FA is given in Algorithm 2.
Algorithm 2 Q-learning with FA
1: Output: Action value function Q(s, a, w)2: Initialize: Arbitrarily set action-value function weights w (e.g., w = 0)3: for each episode do4: Initialize state s5: for each step of episode, state s is not terminal do6: a← action for s derived by Q, e.g., ε-greedy7: take action a, observe r and s′
8: y ← r + γmaxa′ Q(s′, a′, w)9: w ← w + αc(y −Q(s, a, w))∇wQ(s, a, w)
10: s← s′
11: end for12: end for
The effectiveness of Algorithm 2 depends on the choice of FA. Over the past decade, various FAs, both
parametric and non-parametric, have been proposed, including linear basis, Gaussian processes, radial basis, and
Fourier basis. For most of these choices, Q-learning with TD targets may be biased49. Further, unlike Algorithm
1, the asymptotic convergence of Q(s, a, w) → Q(s, a) with Algorithm 2 is not guaranteed. This is primarily
due to Algorithm 2 using an off-policy (i.e., behaviour distribution), bootstrapping (i.e., TD target) and FA
approach—a combination known in the RL community as the ‘deadly triad’11. The possibility of divergence
in the presence of the deadly triad are well known. Several examples have been published: see Tsitsiklis and
Van Roy 50 , Baird 51 , Fairbank and Alonso 52 . The root cause for the instability remains unclear—taken one
by one, the factors listed above are not problematic. There are still many open problems in off-policy learning.
Despite the lack of theoretical convergence guarantees, all three elements of the deadly triad are also necessary for
learning to be effective in practical applications. For example, FA is required for scalability and generalization,
bootstrapping for computational and data efficiency, and off-policy learning for decoupling the behaviour policy
from the target policy. Despite the limitations of Algorithm 2, recently, Mnih et al. 20,19 have successfully adapted
Q-learning with FAs to learn to play Atari games from pixels. For details, see Kober et al. 53 .
Policy Gradient
While Algorithm 2 generalizes Q-learning to continuous spaces, the method lacks convergence guarantees except
for with linear FAs, where it has been shown not to diverge. Moreover, target calculations in (14) and greedy
11
action calculations in (17) require maximization of the Q-function over the action space. Such optimization
steps are computationally impractical for large and unconstrained FAs, and for continuous action spaces. Control
applications typically have both these complicating characteristics, making Algorithm 2 is nontrivial to implement.
Algorithm 3 Policy Gradient – REINFORCE
1: Output: Optimal policy π(a|s, θ)2: Initialize: Arbitrarily set policy parameters θ3: for true do4: generate an episode s0, a0, r1, . . . , sT−1, aT−1, rT , following π(·|·, θ)5: for each step t of episode 0, 1, . . . T − 1 do6: Rt ← return from step t7: θ ← θ + αa,tγ
tRt∇θπ(at|st, θ)8: end for9: end for
Instead of approximating Q? and then computing π (see Algorithms 1 and 2), an alternative approach is to
directly compute the optimal policy, π? without consulting the optimal Q-function. The Q-function may still
be used to learn the policy, but is not required for action selection. Policy gradient methods are reinforcement
Learning algorithms that work directly in the policy space. Two separate formulations are available: the average
reward formulation and the start state formulation. In the start state formulation, the goal of an agent is to
obtain a policy πθ, parameterized by θ ∈ Rnθ , that maximizes the value of starting at state s0 ∈ S and following
policy πθ. For a given policy, πθ, the performance of the agent can be evaluated as follows:
J(πθ) = V πθ (s0) = Eh∼pπ(·)[R1(h)|s0], (18)
Observe that the agent’s performance in (18) is completely described by the policy parameters in θ. A policy
gradient algorithm maximizes (18) by computing an optimal value of θ. A stochastic gradient ascent (SGA)
algorithm incrementally adjusts θ in the direction of ∇θJ(πθ), such that
θt+1 ← θt + αa,t∇θJ(πθ)|θ=θt , (19)
where αa,t is the learning rate. Note that calculating ∇θJ(πθ) requires access to the distribution of states under
the current policy, which as noted earlier, is unknown for most processes. The implementation of policy gradient is
made effective by the policy gradient theorem that calculates a closed-form solution for ∇θJ(πθ) without reference
to the state distribution54. The policy gradient theorem establishes that
∇θJ(πθ) = Est∼ρπγ (·), at∼πθ(·|st)[Qπ(st, at)∇θ log πθ(at|st)
], (20)
where ρπγ (s) :=∑∞t=0 γ
tp(st = s|s0, πθ) is the discounted state visitation distribution and p(st|s0, π) is a t-step
ahead state transition density from s0 to st. Equation (20) gives a closed-form solution for the gradient in (19) in
terms of the Q-function and the gradient of the policy being evaluated. Further, (20) assumes a stochastic policy
12
(observe the expectation over the action space). Now, since the expectation is over the policy being evaluated,
i.e.. πθ, (20) is an on-policy gradient. Note that an off-policy gradient theorem can also be derived for a class of
stochastic policies. See Degris et al. 55 for details.
To implement (19) with the policy gradient theorem in (20), we replace the expectation in (19) with its
sample-based estimate, and replace Qπ with the actual returns, Rt. This leads to a policy gradient algorithm,
called REINFORCE56. Pseudo-code for REINFORCE is given in Algorithm 3. In contrast to Algorithms 1
and 2, Algorithm 3 avoids solving complex optimization problems over continuous action spaces and generalizes
effectively in continuous spaces. Further, unlike Q-learning that always learns a deterministic greedy policy,
Algorithm 3 supports both deterministic and stochastic policies. Finally, Algorithm 3 exhibits good convergence
properties, with the estimate in (19) guaranteed to converge to a local optimum if the estimation of ∇θJ(πθ) is
unbiased (see Sutton et al. 54 for a detailed proof).
Despite the advantages of Algorithm 3 over traditional Q-learning, Algorithm 3 is not amenable to online
implementation as it requires access to Rt – the total reward the agent is expected to receive at the end of an
episode. Further, replacing the Q-function by Rt leads to large variance in the estimation of ∇θJ(πθ)54,57, which
in turn leads to slower convergence58. An approach to address the issues mentioned above is to use a low-variance,
bootstrapped estimate of the Q-function, as in the actor-critic architecture.
Figure 2: A schematic of the actor-critic architecture
Actor-Critic Architecture
The actor-critic is a widely used architecture that combines the advantages of policy gradient withQ-learning54,55,59.
Like policy gradient, actor-critic methods generalize to continuous spaces, while the issue of large variance is coun-
tered by bootstrapping, such as Q-learning with TD update. A schematic of the actor-critic architecture is shown
in Figure 2. The actor-critic architecture consists of two eponymous components: an actor that finds an optimal
policy, and a critic that evaluates the current policy prescribed by the actor. The actor implements the pol-
13
icy gradient method by adjusting policy parameters using SGA, as shown in (19). The critic approximates the
Q-function in (20) using an FA. With a critic, (20) can be approximately written as follows:
∇θJ(πθ) = Est∼ρπγ (·),at∼πθ(·|st)[Qπ(st, at, w)∇θ log πθ(at|st)
], (21)
where w ∈ Rnw is recursively estimated by the critic using SGD in (15) and θ ∈ Rnθ is recursively estimated
by the actor using SGA by substituting (21) into (19). Observe that while the actor-critic method combines
the policy gradient with Q-learning, the policy is not directly inferred from Q-learning, as in (17). Instead, the
policy is updated in the policy gradient direction in (19). This avoids the costly optimization in Q-learning,
and also ensures that changes in the Q-function only result in small changes in the policy, leading to less or no
oscillatory behaviour in the policy. Finally, under certain conditions, implementing (19) with (21) guarantees
that θ converges to the local optimal policy54,55.
Deterministic Actor-Critic Method
For a stochastic policy πθ, calculating the gradient in (20) requires integration over the space of states and actions.
As a result, computing the policy gradient for a stochastic policy may require many samples, especially in high-
dimensional action spaces. To allow for efficient calculation of the policy gradient in (20), Silver et al. 45 propose
a deterministic policy gradient (DPG) framework. This assumes that the agent follows a deterministic policy
µθ : S → A. For a deterministic policy, at = µθ(st) with probability 1, where θ ∈ Rnθ is the policy parameter.
The corresonding performance in (18) can be written as follows
J(µθ) = V µθ (s0) = Eh∼pµ(·)[ ∞∑t=1
γk−1r(st, µθ(st))|s0]. (22)
Similar to the policy gradient theorem in (20), Silver et al. 45 proposed a deterministic gradient theorem to
calculate the gradient of (22) with respect to the policy parameter θ:
∇θJ(µθ) = Est∼ρµγ (·)[∇aQµ(st, a)|a=µθ(st)∇θµθ(st)
], (23)
where ρµγ is the discounted state distribution under policy µθ (similar to ρπγ in (20)). Note that unlike (20),
the gradient in (23) only involves expectation over the states generated according to µθ. This makes the DPG
framework computationally more efficient to implement compared to the stochastic policy gradient.
To ensure that the DPG continues to explore the state and action spaces satisfactorily, it is possible to
implement DPG off-policy. For a stochastic behaviour policy β and a deterministic policy µθ, Silver et al. 45
showed that a DPG exists and can be analytically calculated as
∇θJ(µθ) = Est∼ρβγ (·)[∇aQµ(st, a)|a=µθ(st)∇θµθ(st)
], (24)
14
Algorithm 4 Deterministic Off-policy Actor-Critic Method
1: Output: Optimal policy µθ(s)2: Initialize: Arbitrarily set policy parameters θ and Q-function weights w3: for true do4: initialize s, the first state of the episode5: for s is not terminal do6: a ∼ β(·|s)7: take action a, observe s′ and r8: y ← r + γQµ(s′, µθ(s
′), w)9: w ← w + αw(y −Qµ(s, µθ(s), w))∇wQµ(s, a, w)
10: θ ← θ + αθ∇θµθ(s)∇aQµ(s, a, w)|a=µθ(s)11: s← s′
12: end for13: end for
Figure 3: A deep neural network representation of (a) the actor, and (b) the critic. The red circles represent the input and outputlayers and the black circles represent the hidden layers of the network.
where ρβγ is the discounted state distribution under behaviour policy β. Compared to (23), the off-policy DPG in
(24) involves expectation with respect to the states generated by a behaviour policy β. Finally, the off-policy DPG
can be implemented using the actor-critic architecture. The policy parameters, θ, can be recursively updated by
the actor using (19), where ∇θJ(µθ) is given in (24); and the Q-function, Qµ(st, a) in (24) is replaced with a
critic, Qµ(st, a, w), whose parameter vector w is recursively estimated using (15). Pseudo-code for the off-policy
deterministic actor-critic is given in Algorithm 4.
Deep Reinforcement Learning (DRL) Controller
In this section, we connect the terminologies and methods for RL discussed in Section b, and propose a new
controller, referred to as a DRL controller. The proposed DRL controller is a model-free controller based on
DPG. The DRL controller is implemented using the actor-critic architecture and uses two independent deep
neural networks to generalize the actor and critic to continuous state and action spaces.
15
States, Actions and Rewards
We consider discrete dynamical systems with input and output sequences ut and yt, respectively. For simplicity,
we focus on the case where the outputs yt contain full information on the state of the system to be controlled.
Removing this hypothesis is an important practical element of our ongoing research in this area.
The correspondence between the agent’s action in the RL formulation and the plant input from the control
perspective is direct: we identify at = ut. The relationship between the RL state and the state of the plant is
subtler. The RL state, st ∈ S, must capture all the features of the environment on the basis of which the RL
agent acts. To ensure that the agent has access to relevant process information, we define the RL state as a tuple
of the current and past outputs, past actions and current deviation from the set-point ysp, such that
st := 〈yt, . . . , yt−dy , at−1, . . . , at−da , (yt − ysp)〉, (25)
where dy ∈ N and da ∈ N denote the number of past output and input values, respectively. In this paper, it is
assumed that dy and da are known a priori. Note that for da = 0, dy = 0, the RL state is ‘memoryless’, in that
st := 〈yt, (yt − ysp)〉. (26)
Further, we explore only deterministic policies expressed using a single-valued function µ : S → A, so that for
each state st ∈ S, the probability measure on A defining the next action puts weight 1 on the single point µ(st).
The goal for the agent in a set-point tracking problem is to find an optimal policy, µ, that reduces the tracking
error. This objective is incorporated in the RL agent by means of a reward function r : S × A × S → R, whose
aggregate value the agent tries to maximize. In contrast with MPC, where the controller minimizes the tracking
error over the space of control actions, here the agent maximizes the reward it receives over the space of policies.
We consider two reward functions.
The first – an `1-reward function – measures the negative `1-norm of the tracking error. Mathematically, for
a multi-input and multi-output (MIMO) system with ny outputs, the `1-reward is
r(st, at, st+1) = −ny∑i=1
|yi,t − yi,sp|, (27)
where yi,t ∈ Y are the i-th output, yi,sp ∈ Y is the set-point for the i-th output. Variants of the `1-reward function
are presented and discussed in Section b. The second reward – a polar reward function – assigns a 0 reward if the
tracking error is a monotonically decreasing function at each sampling time for all ny outputs or −1 otherwise.
Mathematically, the polar reward function is
r(st, at, st+1) =
0 if |yi,t − yi,sp| > |yi,t+1 − yi,sp| ∀i ∈ {1, . . . , ny}
−1 otherwise
(28)
16
Observe that a polar reward (28) incentivizes gradual improvements in tracking performance, which leads to less
aggressive control strategy and a smoother tracking compared to the `1-reward in (27).
Policy and Q-function Approximations
In this section, we discuss the FAs used by the DRL controller to approximate the policy and the Q-function. We
use neural networks to generalize the policy and the Q-function to continuous state and action spaces (see b for
basics of neural networks). The policy, µ, is represented using a deep feed-forward neural network, parameterized
by weights Wa ∈ Rna , such that given st ∈ S and Wa, the policy network produces an output at = µ(st,Wa).
Similarly, the Q-function is also represented using a deep neural network, parameterized by weights Wc ∈ Rnc ,
such that given (st, at) ∈ S ×A and Wc, the Q-network outputs Qµ(st, at,Wc).
Figure 4: A schematic of the proposed DRL controller architecture for set-point tracking problems.
Learning Control Policies
As noted, the DRL controller uses a deterministic off-policy actor-critic architecture to learn the control policy
in set-point tracking problems. The proposed method is similar to Algorithm 4 or the one proposed by Lillicrap
et al. 18 , but modified for set-point tracking problems. In the proposed DRL controller architecture, the actor is
represented by µ(st,Wa) and a critic is represented by Q(st, at,Wc). A schematic of the proposed DRL controller
architecture is illustrated in Figure 3. The critic network predicts Q-values for each state-action pair, and the
actor network proposes an action for a given state. The goal is then to learn the actor and critic neural network
parameters by interacting with the process plant. Once the networks are trained, the actor network is used to
compute the optimal action for any given state.
For a given actor-critic architecture, the network parameters can be readily estimated using SGD (or SGA);
however, these methods are not effective in applications, where the state, action and reward sequences are tem-
porally correlated. This is because SGD provides a 1-sample network update, assuming that the samples are
independently and identically distributed (iid). For dynamic systems, for which a tuple (st, at, rt, st+1) may be
17
temporally correlated to the past tuples (e.g., (st−1, at−1, rt−1, st)), up to the Markov order, such iid network
updates in presence of correlations are not effective.
To make learning more effective for dynamic systems, we propose to break the inherent temporal correlations
between the tuples by randomizing it. To this effect, we use a batch SGD, rather than a single sample SGD
for network training. As in DQN, we use a replay memory (RM) for batch training of the networks. The RM
is a finite memory that stores a a large number of tuples, denoted as {(s(i), a(i), r(i), s′(i))}Ki=1, where K is the
size of the RM. As a queue data structure, the latest tuples are always stored in the RM, and the old tuples
are discarded to keep the cache size constant. At each time, the network is updated by uniformly sampling M
(M ≤ K) tuples from the RM. The critic update in Algorithm 4 with RM results in a batch stochastic gradient,
with the following update step
Wc ←Wc +αcM
M∑i=1
(y(i) −Q(s(i), µ(s(i),Wa),Wc))∇WcQµ(s(i), µ(s(i),Wa),Wc), (29)
where
y(i) ← r(i) + γQµ(s′(i), µ(s′(i),Wa),Wc), (30)
for all i = 1, . . . ,M . The batch update, similar to (29), can also be derived for the actor network using an RM.
We propose another strategy to further stabilize the actor-critic architecture. First, observe that the parame-
ters of the critic network, Wc, being updated in (29) are also used in calculating the target, y in (30). Recall that,
in supervised learning, the actual target is independent of Wc; however, as discussed in Section b, in absence of
actual targets, we use Wc in (30) to approximate the target values. Now, if the Wc updates in (29) are erratic,
then the target estimates in (30) are also erratic, and may cause the network to diverge, as observed in Lillicrap
et al. 18 . We propose to use a separate network, called target network to estimate the target in (30). Our solution
is similar to the target network used in Lillicrap et al. 18 , Mnih et al. 19 . Using a target network, parameterized
by W ′c ∈ Rnc , (30) can be written as follows
y(i) ← r(i) + γQµ(s′(i), µ(s′(i),W ′a),W ′c), (31)
where
W ′c ← τWc + (1− τ)W ′c, (32)
and 0 < τ < 1 is the target network update rate. Observe that the parameters of the target network in (32) are
updated by having them slowly track the parameters of the critic network, Wc. In fact, (32) ensures that the
target values change slowing, thereby improving the stability of learning. A similar target network can also be
used for stabilizing the actor network.
18
Note that while we consider the action space to be continuous, in practice, it is also often bounded. This is
because the controller actions typically involve changing bounded physical quantities, such as flow rates, pressure
and pH. To ensure that the actor network produces feasible controller actions, it is important to bound the
network over the feasible action space. Note that if the networks are not constrained then the critic network will
continue providing gradients that encourage the actor network to take actions beyond the feasible space.
In this paper, we assume that the action space, A is an interval action space, such that A = [aL, aH ], where
aL < aH (the inequality holds element-wise for systems with multiple inputs). One approach to enforce the
constraints on the network is to bound the output layer of the actor network. This is done by clipping the
gradients used by the actor network in the update step (see Step 10 in Algorithm 4). For example, using the
clipping gradient method proposed in60, the gradient used by the actor, ∇aQµ(s, a, w), can be clipped as follows
∇aQµ(s, a, w)← ∇aQµ(s, a, w)×
(aH − a)/(aH − aL), if ∇aQµ(s, a, w) increases a
(a− aL)/(aH − aL), otherwise
(33)
Note that in (33), the gradients are down-scaled as the controller action approaches the boundaries of its range,
and are inverted if the controller action exceeds the range. With (33) in place, even if the critic continually
recommends increasing the controller action, it will converge to the its upper bound, aH . Similarly, if the critic
decides to decrease the controller action, it will decrease immediately. Using (33) ensures that the controller
actions are within the feasible space. Now, putting all the improvement strategies discussed in this section,
a schematic of the proposed DRL controller architecture for set-point tracking is shown in Figure 4, and the
pseudo-code is provided in Algorithm 5.
As outlined in Algorithm 5, first we randomly initialize the parameters of the actor and critic networks (Step
2). We then create a copy of the network parameters and initialize the parameters of the target networks (Step
3). The final initialization step is to supply the RM with a large enough collection of tuples (s, a, s′, r) on which
to begin training the RL controller (Step 4). Each episode is preceded by two steps. First, we ensure the state
definition st in (25) is defined for the first dy steps by initializing a sequence of actions 〈a−1, . . . , a−n〉 along with
the corresponding sequence of outputs 〈y0, . . . , y−n+1〉 where n ∈ N is sufficiently large. A set-point is then fixed
for the ensuing episode (Steps 6-7).
Next, for a given user-defined set-point, the actor, µ(s,Wa) is queried, and a control action is implemented on
the process (Steps 8–10). Implementing µ(s,Wa) on the process generates a new output yt+1. The controller then
receives a reward r, which is the last piece of information needed to store the updated tuple (s, a, r, s′) in the RM
(Steps 9–12). M uniformly sampled tuples are generated from the RM to update the actor and critic networks
19
Algorithm 5 Deep Reinforcement Learning Controller
1: Output: Optimal policy µ(s,Wa)2: Initialize: Wa,Wc to random initial weights3: Initialize: W ′a ←Wa and W ′c ←Wc
4: Initialize: Replay memory with random policies5: for each episode do6: Initialize an output history 〈y0, . . . , y−n+1〉 from an action history 〈a−1, . . . , a−n〉7: Set ysp ← set-point from the user8: for each step t of episode 0, 1, . . . T − 1 do9: Set s← 〈yt, . . . , yt−dy , at−1, . . . , at−da , (yt − ysp)〉
10: Set at ← µ(s,Wa) +N11: Take action at, observe yt+1 and r12: Set s′ ← 〈yt+1, . . . , yt+1−dy , at, . . . , at+1−da , (yt+1 − ysp)〉13: Store tuple (s, at, s
′, r) in RM14: Uniformly sample M tuples from RM15: for i = 1 to M do16: Set y(i) ← r(i) + γQµ(s′(i), µ(s′(i),W ′a),W ′c)17: end for18: Set Wc ←Wc + αc
M
∑Mi=1(y(i) −Qµ(s(i), a(i),Wc))∇Wc
Qµ(s(i), a(i),Wc)19: for i = 1 to M do20: Calculate ∇aQµ(s(i), a,Wc)|a=a(i)21: Clip ∇aQµ(s(i), a,Wc)|a=a(i) using (33)22: end for23: Set Wa ←Wa + αa
M
∑Mi=1∇Waµ(s(i),Wa)∇aQµ(s(i), a,Wc)|a=a(i)
24: Set W ′a ← τWa + (1− τ)W ′a25: Set W ′c ← τWc + (1− τ)W ′c26: end for27: end for
using batch gradient methods (Steps 14–23). Finally, for a fixed τ , we update the target actor and target critic
networks in Steps 24–25. The above steps are then repeated until the end of the episode.
To ensure that proposed method adequately explores the state and action spaces, the behavior policy for the
agent is constructed by adding noise sampled from a noise process, N , to our actor policy, µ (see Step 10 in
Algorithm 5). Generally, N is to be chosen to suit the process. We use a zero-mean Ornstein-Uhlenbeck (OU)
process61 to generate temporally correlated exploration samples; however, the user can define their own noise
process. We also allow for random initialization of system outputs at the start of an episode to ensure that the
policy is not stuck in a local optimum. Finally, it is important to highlight that Algorithm 5 is a fully automatic
algorithm that learns the control policy in real-time by continuously interacting with the process.
Network Structure and Implementation
The actor and critic neural networks each had 2 hidden layers with 400 and 300 units, respectively. We initialized
the actor and critic networks using uniform Xavier initialization62. Each unit was modeled using a Rectified
non-linearity activation function. Example b differs slightly: the output layers were initialized from a uniform
distribution over [−0.003, 0.003], and we elected to use a tanh activation for the second hidden layer. For all
examples, the network hidden layers were batch-normalized using the method in Ioffe and Szegedy 63 to ensure
20
that the training is effective in processes where variables have different physical units and scales. Finally, we
implemented the Adam optimizer in Kingma and Ba 64 to train the networks and regularized the weights and
biases of the second hidden layer with an L2 weight decay of 0.0001.
Algorithm 5 was scripted in Python and implemented on an iMac (3.2 GHz Intel Core i5, 8 GB RAM). The
deep networks for the actor and critic were built using Tensorflow65. For Example b, we trained the networks
on Amazon g2.2xlarge EC2 instances; the matrix multiplications were performed on a Graphics Processing
Unit (GPU) available in the Amazon g2.2xlarge EC2 instances. The hyper-parameters for Algorithm 5 are
process-specific and need to be selected carefully. We list in Table 1 the nominal values for hyper-parameters used
in all of our numerical examples (see Section b); additional specifications are listed at the end of each example.
An optimal selection of hyper-parameters is crucial, but is beyond the scope of the current work.
Table 1: The hyper-parameters used in Algorithm 5.
Hyper-parameter Symbol Nominal value
Actor learning rate αa 10−4
Critic learning rate αc 10−4
Target network update rate τ 10−3
OU process parameters N θ = .15, σ = .30
DRL Controller Tuning
In this section, we provide some general guidelines and recommendations for implementing the DRL controller
(see Algorithm 5) for set-point tracking problems.
(a) Episodes: To begin an episode, we randomly sample an action from the the action space and let the system
settle under that action before starting the time steps in the main algorithm. For the examples considered in
Section b, each episode is terminated after 200 time steps or when |yi,t − yi,sp| ≤ ε for all 1 ≤ i ≤ ny for
5 consecutive time steps, where ε is a user-defined tolerance. For processes with large time constants, it is
recommended to run the episodes longer to ensure that the transient and steady-state dynamics are effectively
captured in the input-output data.
(b) Rewards: We consider the reward hypotheses (27) and (28) in our examples. A possible variant of (27) is
given as follows
r(st, at, st+1) =
c if |yi,t − yi,sp| ≤ ε ∀i ∈ {1, 2, . . . , ny}
−∑nyi=1 |yi,t − yi,sp| otherwise.
(34)
where yi,t ∈ Y are the i-th output, yi,sp ∈ Y is the set-point for the i-th output, c ∈ R+ is a constant reward,
and ε ∈ R+ is a user-defined tolerance. According to (34), the agent receives the maximum reward, c, only if
21
all the outputs are within the tolerance limit set by the user. The potential advantage of (34) over (27) is that
it can lead to faster tracking. On the other hand, the control strategy learned by the RL agent is generally
more aggressive. Ultimately, we prefer the `1-reward function due the gradual increasing behavior of the function
as the RL agent learns. Note that under this reward hypothesis the tolerance ε in (34) should be the same as
the early termination tolerance described previously. These hypotheses are well-suited for set-point tracking as
they use tracking error to generate rewards; however, in other problems, it is imperative to explore other reward
hypothesis. Some examples, include – (i) negative of 2-norm of the control error instead of 1-norm in (27); (ii)
weighted rewards for different outputs in a MIMO system; (iii) economic reward function.
(c) Gradient-clipping: The gradient-clipping in (33) ensures that the DRL controller outputs are feasible and
within the operating range. Also, defining tight lower and upper limits on the inputs has a significant effect on
the learning rate. In general, setting tight limits on the inputs lead to faster convergence of the DRL controller.
(d) RL state: The definition of an RL state is critical as it affects the performance of the DRL controller. As
shown in (25), we use a tuple of current and past outputs, past actions and current deviation from the set-point
as the RL state. While (25) is well-suited for set-point tracking problems, it is instructive to highlight that the
choice of an RL state is not unique, as there may be other RL states for which the DRL controller could have
a similar or better performance. In our experience, including additional relevant information in the RL states
improves the overall performance and convergence-rate of the DRL controller. The optimal selection of an RL
state is not within the scope of the paper; however, it certainty is an important area of research that warrants
additional investigation.
(e) Replay memory: We initialize the RM with M tuples (s, a, s′, r) generated by simulating the system
response under random actions, where M is the batch size used for training the networks. During initial learning,
having a large RM is essential as it gives DRL controller access to the data generated from the old policy. Once
the actor and critic networks are trained, and the system attains steady-state, the RM need not be updated as
the input-output data no longer contributes new information to the RM.
(f) Learning: The DRL controller learns the policy in real-time and continues to refine it as new experiences are
collected. To reduce the computational burden of updating the actor and critic networks at each sampling time,
it is recommended to ‘turn-off’ learning once the set-point tracking is complete. For example, if the difference
between the output and the set-point is within certain predefined threshold, the networks need not be updated.
We simply terminate the episode once the DRL controller has tracked the set-point for 5 consecutive time steps.
(g) Exploration: Once the actor and critic networks are trained, it is recommended that the agent stops
exploring the state and action spaces. This is because adding exploration noise to a trained network adds
22
unnecessary noise to the control actions as well as the system outputs.
Simulation Results
The efficacy of the DRL controller outlined in Algorithm 5 is demonstrated on four simulation examples. The
first three examples, include: a single-input-single-output (SISO) paper-making machine; a multi-input-multi-
output (MIMO) high purity distillation column; and a nonlinear MIMO heating, ventilation, and air conditioning
(HVAC) system. The fourth example evaluates the robustness of Algorithm 5 to process changes.
Example 1: Paper Machine
In the pulp and paper industry, a paper machine is used to form the paper sheet and then remove water from
the sheet by various means, such as vacuuming, pressing, or evaporating. The paper machine is divided into two
main sections: wet-end and dry-end. The sheet is formed in the wet-end on a continuous synthetic fabric, and
the dry-end of the machine removes the water in the sheet through evaporation. The paper passes over rotating
steel cylinders, which are heated by super-heated pressurized steam. Effective control of the moisture content in
the sheets is an active area of research.
Figure 5: Simulation results for Example 1 – (a) moving average with window size 21 (the number of distinct set-points used intraining) for the total reward per episode; (b) tracking performance of the DRL controller on a sinusoidal reference signal; and (c)the input signal generated by the DRL controller.
In this section, we design a DRL controller to control the moisture content in the sheet, denoted by yt (in
percentage), to a desired set-point, ysp. There are several variables that affect the drying time of the sheet, such
as machine speed, steam pressure, drying fabric, etc. For simplicity, we only consider the effect of steam flow-rate,
denoted by at (in m3/hr) on yt. For simulation purposes, we assume that at and yt are related according to the
following discrete-time transfer function
G(z) =0.05z−1
1− 0.6z−1. (35)
Note that (35) is strictly used for generating episodes, and is not used in the design of the DRL controller. In
fact, the user has complete freedom to substitute (35) with any linear or nonlinear process model, in case of
simulations; or with the actual data from the paper machine, in case of industrial implementation. Next, we
23
implement Algorithm 5 with the hyper-parameters listed in Tables 1–2 and additional specifications discussed at
the end of the example.
Figure 6: Simulation results for Example 1 – (a) tracking performance of the DRL controller on randomly generated set-points outsideof the training interval [0, 10]; (b) the input signal generated by the DRL controller.
The generated data is then used for real-time training of the actor and critic networks in the DRL controller.
Figure 5(a) captures the increasing trend of the cumulative reward per episode by showing the moving average
across the number of set-points used in training. From Figure 5(a), it is clear that as the DRL controller interacts
with the process, it learns a progressively better policy, leading to higher rewards per episode. In less than 100
episodes, the DRL controller is able to learn a policy suitable for basic set-point tracking; it was able to generalize
effectively in the remaining 400 episodes, as illustrated in Figures 5(b)-(c) and 6.
Figures 5(b)-(c) showcase the DRL controller, putting aside the physical interpretation of our system (35).
We highlight the smooth tracking performance of the DRL controller on a sinusoidal reference signal in Figure
5(b), as this exemplifies the responsiveness and robustness of the DRL controller to tracking problems far different
from the fixed collection of set-points in the interval [0, 10] seen during training.
To illustrate this further, observe in Figure 6(a) that the DRL controller is able to track randomly generated
set-points with little overshoot, both inside and outside the training output space Y. Finally, noting that our
action space for training is [0, 100], Figure 6(b) shows the DRL controller is operating outside the action space
to track the largest set-point.
Table 2: Specifications for Example 1.
Hyper-parameter Symbol Nominal value
Episodes 500Mini-batch size M 128Replay memory size K 5× 104
Reward discount factor γ 0.99Action space A [0, 100]Output space Y [0, 10]
Further implementation details: We define the states according to (26) and select the `1-reward function in
24
(27). Concretely,
r(st, at, st+1) = −|yt − ysp|, (36)
where ysp is uniformly sampled from {0, 0.5, 1, . . . , 9.5, 10} at the beginning of each episode. We added zero-mean
Gaussian noise with variance .01 to the outputs during training. We defined the early-termination tolerance for
the episodes to be ε = .01.
Example 2: High-purity Distillation Column
In this section, we consider a two-product distillation column as shown in Figure 7. The objective of the distillation
column is to split the feed, which is a mixture of a light and a heavy component, into a distillate product, y1,t, and
a bottom product, y2,t. The distillate contains most of the light component and the bottom product contain most
of the heavy component. Figure 7 has multiple manipulated variables; however, for simplicity, we only consider
two inputs: boilup rate, u1,t and the reflux rate, u2,t. Note that the boilup and reflux rates have immediate
effect on the product compositions. The distillation column shown in Figure 7 has a complex nonlinear dynamics;
however, for simplicity, we linearize the model around the steady-sate value. The transfer function for the
distillation column is given as follows66
G(s) =
0.878
τs+ 1− 0.864
τs+ 1
1.0819
τs+ 1−1.0958
τs+ 1
, where τ > 0. (37)
Further, to include measurement uncertainties, we add a zero-mean Gaussian noise to (37) with variance 0.01.
The authors in66 showed that a distillation column with a model representation in (37) is ill-conditioned. In other
words, the plant gain strongly depends on the input direction, such that inputs in directions corresponding to
high plant gains are strongly amplified by the plant, while inputs in directions corresponding to low plant gains
are not. The issues around controlling ill-conditioned processes are well known to the community67,68.
Next, we implement Algorithm 5 with the reward hypothesis in (27). Figures 8(a) and (b) show the perfor-
mance of the DRL controller in tracking the distillate and the bottom product to desired set-points for a variety of
random initial starting points. Observe that starting from some initial condition, the DRL controller successfully
tracks the set-points within t = 75 minutes. Figures 8(c) and (d) show the boilup and reflux rates selected by the
DRL controller, respectively. This example, clearly highlights the efficacy of the DRL controller in successfully
tracking a challenging, ill-conditioned MIMO system. Finally, it is instructive to highlight that the model in (37)
is strictly for generating episodes, and is not used by the DRL controller. The user has freedom to replace (37)
either with a complex non-linear model, or with the actual plant data.
25
Condenser
holdupCondenser
DistillateReflux
ReboilerBoilup
Bottom Product
Reboilerholdup
Feed
Overheadvapor
y1,t
y2,t
u2,t
u1,t
Figure 7: A schematic of a two-product distillation column in Example 2. Adapted from66.
Further implementation details: First, note our system (37) needs to be discretized when implementing Al-
gorithm 5. We used the `1-reward hypothesis given in (27) and defined the RL state as in (26). To speed up
learning, it is important to determine which pairs of set-points make sense in practice. To address this, we
refer to Figure 9. We uniformly sample constant input signals from [0, 50] × [0, 50] and let the MIMO system
(37) settle for 50 time steps. Figure 9(a) shows these settled outputs. Clearly, we have very little flexibility
when initializing feasible set-points in Algorithm 5. Therefore, we only select set-points pairs (y1,sp, y2,sp) where
y1,sp, y2,sp ∈ {0, .5, 1, . . . , 4.5, 5} and |y1,sp − y2,sp| ≤ .5 Finally, Figure 9(a) shows the outputs in a restricted
subset of the plane, [−5, 10] × [−5, 10]; the outputs can far exceed this even when sampling action pairs within
the action space. A quick way to initialize an episode such that the outputs begin around the desired output
space, [0, 5] × [0, 5], is to initialize action pairs according to a linear regression such as the one shown in Figure
9(b); we also added zero-mean Gaussian noise with variance 1 to this line. We reiterate that these steps are
implemented purely to speed up learning, as the DRL controller will saturate and accumulate a large amount of
negative reward when tasked with set-points outside the scope of Figure 9(a). Further, this step does not solely
rely on the process model (37), as similar physical insights could be inferred from real plant data or by studying
the reward signals associated with each possible set-point pair as Algorithm 5 progresses. Figure 11 at the end of
Example b demonstrates a possible approach to determining feasible set-points and effective hyper-parameters.
Table 3: Specifications for Example 2.
Hyper-parameter Symbol Nominal value
Episodes 5,000Mini-batch size M 128Replay memory size K 5× 105
Reward discount factor γ 0.95Action space A [0, 50]Output space Y [0, 5]
26
Figure 8: Simulation results for Example 2 – (a) and (b) set-point tracking for distillate and bottom product, respectively, where thedifferent colors correspond to different starting points; (c) and (d) boilup and reflux rates selected by the DRL controller, respectively.
Figure 9: Set-point selection for Example b – (a) the distribution of settled outputs after simulating the MIMO system (37) withuniformly sampled actions in [0, 50]× [0, 50]; (b) the corresponding action pairs such that the settled outputs are both within [0, 5],along with a linear regression of these samples.
Example 3: HVAC System
Despite the advances in research on HVAC control algorithms, most field equipment is controlled using classical
methods, such as hysteresis/on/off and Proportional Integral and Derivative (PID) controllers. Despite their
popularity, these classical methods do not perform optimally. The high thermal inertia of buildings induces
large time delays in the building dynamics, which cannot be handled efficiently by the simple on/off controllers.
Furthermore, due to the high non-linearity in building dynamics coupled with uncertainties such as weather, energy
pricing, etc., these PID controllers require extensive re-tuning or auto-tuning capabilities69, which increases the
difficulty and complexity of the control problem70.
Due to these challenging aspects of HVAC control, various advanced control methods have been investigated,
27
Figure 10: Simulation results for Example 3 – (a) the response of the discharge air temperature to the set-point change and (b) thecontrol signal output calculated by the DRL controller.
ranging from non-linear model predictive control71 to optimal control72. Between these methods, the quality of
control performance relies heavily on accurate process identification and modelling; however, large variations exist
with building design, zone layout, long-term dynamics and wide ranging operating conditions. In addition, large
disturbance effects from external weather, occupancy schedule changes and varying energy prices make process
identification a very challenging problem70.
In this section we demonstrate the efficacy of the proposed model-free DRL controller to control the HVAC
system. First, to generate episodes, we assume that the HVAC system can be modelled as follows73:
Two,t = Two,t−1 + θ1fw,t−1(Twi,t−1 − Two,t−1) +(θ2 + θ3fw,t−1 + θ4fa,t−1
)[Tai,t−1 − Tw,t−1], (38a)
Tao,t = Tao,t−1 + θ5fa,t−1(Tai,t−1 − Tao,t−1) + (θ6 + θ7fw,t−1 + θ8fa,t−1)(Tw,t−1 − Tai,t−1) + θ9(Tai,t − Tai,t−1),
(38b)
Tw,t = 0.5[Twi,t + Two,t], (38c)
where Two ∈ R+ and Tao ∈ R+ are the outlet water and discharge air temperatures (in ◦ C), respectively;
Twi ∈ R+ and Tai ∈ R+ are the inlet water and air temperatures (in ◦ C), respectively; fw ∈ R+ and fa ∈ R+
are the water and air mass flow rates (in kg/s), respectively; and θi ∈ R for i = 1, . . . , 9 are various physical
parameters, with nominal values given in73. The water flow rate, fw, is assumed to be related to the controller
action as follows:
fw,t = θ10 + θ11at + θ11a2t + θ12a
3t , (39)
where at is the control signal in terms of 12-bits and θj , for j = 10, 11, 12 are valve model parameters given in73.
In (38a)–(38c), Twi, Tai and fa are disturbance variables that account for the set-point changes and other
external disturbances. Furthermore, Twi, Tai and fa are assumed to follow constrained random-walk models, such
28
that
0.6 ≤ fa,t ≤ 0.9; 73 ≤ Twi,t ≤ 81; 4 ≤ Tai,t ≤ 10, (40)
for all t ∈ N. For simplicity, we assume that the controlled variable is Tao and the manipulated variable is fw. The
objective is to design a DRL controller to achieve desired discharge air temperature, yt ≡ Tao,t by manipulating
at.
Next, we implement Algorithm 5 to train the DRL controller for the HVAC system. Using the `1-reward
hypothesis in (27) we observed a persistent offset in Tao to set-point changes. Despite increasing the episode
length, the DRL controller was still unable to eliminate the offset with (27). Next, we implement Algorithm 5
with the polar reward hypothesis in (28). With (28), the response to control actions was found to be less noisy and
offset-free. In fact, Algorithm 5 performed much better with (28) in terms of noise and offset reduction, compared
to (27). Figure 10(a) shows the the response of the discharge air temperature to a time-varying set-point change.
Observe that the DRL controller is able to track the set-point changes in less than 50 seconds with small offset.
Moreover, it is clear that the response to the change in control action with (28) is smooth. Finally, the control
signal generated by the DRL controller is shown in Figure 10(b). This example demonstrates the efficacy of
Algorithm 5 in learning the control policy of complex nonlinear processes simply by interacting with the process
in real-time.
Table 4: Specifications for Example 3.
Hyper-parameter Symbol Nominal value
Episodes 100,000Mini-batch size M 64Replay memory size K 106
Reward discount factor γ 0.99Action space A [150, 800]Output space Y [30, 35]
Further implementation details: We contrast the exceedingly long training time listed in Table 4 to generate
Figure 10 by noting from Figure 11 that the DRL controller was able to learn to track a fixed set-point, ysp = 30,
in fewer than 1000 episodes. In general, training the DRL controller with a single set-point is a reasonable starting
point for narrowing the parameter search when implementing Algorithm 5.
Example 4: Robustness to Process Changes
In this section, we evaluate the robustness of DRL controller to adapt to abrupt process changes. We revisit the
example from the pulp and paper industry in Section b.
We run Algorithm 5 continuously with the same specifications as in Example b for a fixed set-point ysp = 2.
As shown in Figures 12(a) and (b), in the approximate interval 0 ≤ t ≤ 1900, the DRL agent is learning the
29
Figure 11: Snapshots during training of the input and output signals – (a) progression of set-point tracking performance of the DRLcontroller; (b) the respective controller actions taken by the DRL controller.
control policy. Then at around t = 1900 we turn off the exploration noise and stop training the actor and critic
networks because the moving average of the errors |yt − 2| across the past four time steps was sufficiently small
(less than 0.0001 in our case). Next, at around t = 2900, a sudden process change is introduced by doubling the
process gain in (35). Observe that as soon as the process change is introduced, the policy learned by the DRL
controller is no longer valid. Consequently, the system starts to deviate from the set-point. However, since the
DRL controller learns in real-time, as soon as the process change is introduced, the controller starts re-learning a
new policy. Observe that for the same set-point ysp = 2, the DRL controller took about 400 seconds to learn the
new policy after the model change was introduced (see Figures 12(a) and (b) for t ≥ 2900). This demonstrates
how the DRL controller first learns the process dynamics and then learns the control policy – both in real-time,
without access to a priori process information.
Figure 12: Simulation results for Example 4 – (a) set-point tracking performance of the DRL controller under abrupt process changes;(b) the controller action taken by the DRL controller.
Comparison with Model Predictive Control
Model Predictive Control (MPC) is well-established and widely deployed. The proposed DRL controller has a
number of noteworthy differences.
30
(a) Process model: An accurate process model is a key ingredient in any model predictive controller. MPC
relies on model-based predictions and model-based optimization to compute the control action at each sampling
time74. In contrast, the DRL controller does not require any a priori access to a process model: it develops the
control policy in real time by interacting with the process.
(b) Constraints: Industrial control systems are subject to operational and environmental constraints that need
to be satisfied at all sampling times. MPC handles these constraints by explicitly incorporating them into the
optimization problem solved at each step75,76. In the DRL controller, constraints can be embedded directly in
the reward function or enforced through gradient-clipping—see (33). The latter approach is softer: constraint
violation may occur during training, but it is discouraged by triggering an enormous penalty in the reward signal.
Robust mechanisms for safe operation of a fully trained system, and indeed, for safe operation during online
training, are high priorities for further investigation.
(c) Prediction horizon: MPC refers to a user-specified prediction horizon when optimizing a process’s future
response. The duration of the MPC’s planning horizon determines the influence of future states; it may be either
finite or infinite, depending on the application. The corresponding adjustment in the DRL controller comes
through the discount factor γ ∈ [0, 1] used to determine the present value of future rewards.
(d) State estimation: The performance of MPC relies on an accurate estimate of the hidden process states over
the length of the horizon. The state estimation problem is typically formulated as a filtering problem77,78, with
Kalman filtering providing the standard tool for linear processes, while nonlinear systems remain an active area
of research79,80. In this paper, we consider the DRL controller for cases where the full system state is available
for policy improvement. Extending the design to include filtering and/or observer design will be the subject of
future work.
(e) Adaptation: Practical control systems change over time. Maintaining acceptable performance requires
responding to both changes and unknown disturbances. Traditional MPC-based systems include mechanisms for
detecting model-plant mismatch and responding as appropriate . . . typically by initiating interventions in the
process to permit re-identification of the process model. This process, which calls for simultaneous state and
parameter estimation, can be both challenging and expensive81,82,83. Some recent variants of MPC, such as
robust MPC and stochastic MPC, take model uncertainties into account by embedding them directly into the
optimization problem76. The DRL controller, on the other hand, updates the parameters in the actor and critic
networks at each sampling time using the latest experiences. This gives the DRL controller a self-tuning aspect,
so that process operations remain nearly optimal with respect to the selected reward criterion.
31
Limitations
In the decades since MPC was first proposed its theoretical foundations have become well established. Strong
convergence and stability proofs are available for several MPC formulations, covering both linear and nonlin-
ear systems84. At this comparatively early stage, the DRL controller comes with no such optimality guarantees.
Indeed, the assumptions of discrete-time dynamics and full state observation are both obvious challenges demand-
ing further investigation and development. (For example, while the temporal difference error for a discrete-time
system can be computed in a model-free environment, the TD error formulation for a continuous-time system
requires complete knowledge of its dynamics85.) Other issues of interest include the role of data requirements,
computational complexity, non-convexity, over- and under-fitting, performance improvement, hyper-parameter
selection and initialization.
Conclusions
We have developed an adaptive deep reinforcement learning (DRL) controller for set-point tracking problems
in discrete-time nonlinear processes. The DRL controller is a data-based controller based on a combination of
reinforcement learning and deep-learning. As a model-free controller, this DRL controller learns the control policy
in real time, simply by interacting with the process. The efficacy of the DRL controller has been demonstrated
in set-point tracking problems on a SISO, a MIMO and a non-linear system with external disturbances. The
DRL controller shows significant promise as an alternative to traditional model-based industrial controllers, even
though some practical and theoretical challenges remain to be overcome. As both computing power and data
volumes continue to increase, DRL controllers have the potential to become an important tool in process control.
Acknowledgements
We would like to thank MITACS for financial support through the Globalink Graduate Fellowship. The first
author would like to thank Kevin and Zilun for helpful discussions.
Basics of Neural Networks
An artificial neural network (NN) is a mapping F : U → Y from the input space U ⊆ Rnu to the output space
Y ⊆ Rny whose internal structure is inspired by natural networks of neurons, connected in several “layers”. Each
layer defines a simple input-output relation of the form
y = f(Wu), (.1)
32
where W ∈ Rny×nu is a weight matrix carrying the vector space of inputs to the vector space of outputs, and
f is a (typically nonlinear) “activation function” acting elementwise on its input. Common activation functions
include
f(u) = σ(u) :=1
1 + exp−u, f(u) = tanh(u) :=
exp2u−1
exp2u +1, f(u) = ReLU(u) := max {0, u} . (.2)
The weight matrices and activation functions may differ from layer to layer, so that a 3-layer NN model could
have the general form
y = W4f3(W3(f2(W2(f1(W1u))))). (.3)
(Choosing the identity for the final activation function is quite common.) Different sparsity patterns in the weight
matrices Wk allow for different connections between neurons in successive layers. Completing the definition of F
requires selecting the elements of the weight matrices Wk: an optimization problem is formulated to minimize the
discrepancy between the input-output pairs produced by the mapping F and a collection of input-output pairs
the network is expected to “learn”. Typically the dimensions ny and nu are very large, so the number of choice
variables in this optimization is enormous. Some form of gradient descent is the only practical scheme for the
iterative solution of the problem, a process evocatively called training the NN.
33
List of Figures
Fig. 1: A schematic of the agent-environment interactions in a standard RL problem.
Fig. 2: A schematic of the actor-critic architecture
Fig. 3: A deep neural network representation of (a) the actor, and (b) the critic. The red circles represent the
input and output layers and the black circles represent the hidden layers of the network.
Fig. 4: A schematic of the proposed DRL controller architecture for set-point tracking problems.
Fig. 5: Simulation results for Example 1 – (a) moving average with window size 21 (the number of distinct set-
points used in training) for the total reward per episode; (b) tracking performance of the DRL controller on a
sinusoidal reference signal; and (c) the input signal generated by the DRL controller.
Fig. 6: Simulation results for Example 1 – (a) tracking performance of the DRL controller on randomly generated
set-points outside of the training interval [0, 10]; (b) the input signal generated by the DRL controller.
Fig. 7: A schematic of a two-product distillation column in Example 2. Adapted from66.
Fig. 8: Simulation results for Example 2 – (a) and (b) set-point tracking for distillate and bottom product,
respectively, where the different colors correspond to different starting points; (c) and (d) boilup and reflux rates
selected by the DRL controller, respectively.
Fig. 9: Set-point selection for Example b – (a) the distribution of settled outputs after simulating the MIMO
system (37) with uniformly sampled actions in [0, 50] × [0, 50]; (b) the corresponding action pairs such that the
settled outputs are both within [0, 5], along with a linear regression of these samples.
Fig. 10: Simulation results for Example 3 – (a) the response of the discharge air temperature to the set-point
change and (b) the control signal output calculated by the DRL controller.
Fig. 11: Snapshots during training of the input and output signals – (a) progression of set-point tracking perfor-
mance of the DRL controller; (b) the respective controller actions taken by the DRL controller.
Fig. 12: Simulation results for Example 4 – (a) set-point tracking performance of the DRL controller under abrupt
process changes; (b) the controller action taken by the DRL controller.
34
References
[1] Tulsyan A, Garvin C, Undey C. Machine-learning for biopharmaceutical batch process monitoring with limited data. IFAC-
PapersOnLine 2018;51(18):126–131.
[2] Tulsyan A, Garvin C, Undey C. Advances in industrial biopharmaceutical batch process monitoring: Machine-learning methods
for small data problems. Biotechnology and bioengineering 2018;115(8):1915–1924.
[3] Tulsyan A, Garvin C, Undey C. Industrial batch process monitoring with limited data. Journal of Process Control 2019;77:114–
133.
[4] Kano M, Ogawa M. The state of the art in advanced chemical process control in Japan. IFAC Proceedings Volumes
2009;42(11):10–25.
[5] Tulsyan A, Khare SR, Huang B, Gopaluni RB, Forbes JF. Bayesian identification of non-linear state-space models: Part I-Input
design. IFAC Proceedings Volumes 2013;46(32):774–779.
[6] Seborg DE, Edgar TF, Shah SL. Adaptive control strategies for process control: a survey. AIChE Journal 1986;32(6):881–913.
[7] Krstic M, Kanellakopoulos I, Kokotovic PV. Nonlinear and Adaptive Control Design. ohn Wiley & Sons, New York; 1995.
[8] Mailleret L, Bernard O, Steyer JP. Nonlinear adaptive control for bioreactors with unknown kinetics. Automatica
2004;40(8):1379–1385.
[9] Guan C, Pan S. Adaptive sliding mode control of electro-hydraulic system with nonlinear unknown parameters. Control
Engineering Practice 2008;16(11):1275–1284.
[10] Spielberg SPK, Gopaluni RB, Loewen PD. Deep reinforcement learning approaches for process control. In: Proceedings of the
6th IFAC Symposium on Advanced Control of Industrial Processes Taipei, Taiwan; 2017. p. 201–206.
[11] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT Press, Cambridge; 1998.
[12] Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on
Decision and Control, vol. 1 IEEE Publ. Piscataway, NJ; 1995. p. 560–564.
[13] Bertsekas DP. Dynamic Programming and Optimal Control. Athena Scientific, MA; 2005.
[14] Powell WB. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, New York; 2007.
[15] Sugiyama M. Statistical Reinforcement Learning: Modern Machine Learning Approaches. CRC Press, Florida; 2015.
[16] Bellman R. Dynamic Programming. Dover Publications, New York; 2003.
[17] Lehnert L, Precup D. Policy Gradient Methods for Off-policy Control. arXiv Preprint, arXiv:151204105 2015;.
[18] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. arXiv
Preprint, arXiv:150902971 2015;.
[19] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning.
arXiv Preprint, arXiv:13125602 2013;.
[20] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement
learning. Nature 2015;518:529–533.
[21] Pednault E, Abe N, Zadrozny B. Sequential cost-sensitive decision making with reinforcement learning. In: Proceedings of the
8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Edmonton, Canada; 2002. p. 259–268.
[22] Tesauro G. Temporal difference learning and TD-Gammon. Communications of the ACM 1995;38(3):58–68.
[23] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, et al. Mastering the game of Go with deep neural
networks and tree search. Nature 2016;529:484–489.
35
[24] Badgwell TA, Lee JH, Liu KH. Reinforcement Learning–Overview of Recent Progress and Implications for Process Control. In:
Computer Aided Chemical Engineering, vol. 44 Elsevier; 2018.p. 71–85.
[25] Lee JH, Lee JM. Approximate dynamic programming based approach to process control and scheduling. Computers & Chemical
Engineering 2006;30(10-12):1603–1618.
[26] Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement learning and feedback control: using natural decision methods to design
optimal adaptive controllers. IEEE Control Systems 2012;32(6):76–105.
[27] Morinelly JE, Ydstie BE. Dual mpc with reinforcement learning. IFAC-PapersOnLine 2016;49(7):266–271.
[28] Lee JM, Kaisare NS, Lee JH. Choice of approximator and design of penalty function for an approximate dynamic programming
based control approach. Journal of process control 2006;16(2):135–156.
[29] Lee JM, Lee JH. Neuro-dynamic programming method for MPC1. IFAC Proceedings Volumes 2001;34(25):143–148.
[30] Kaisare NS, Lee JM, Lee JH. Simulation based strategy for nonlinear optimal control: application to a microbial cell reactor.
International Journal of Robust and Nonlinear Control 2003;13(3-4):347–363.
[31] Lee JM, Lee JH. Simulation-based learning of cost-to-go for control of nonlinear processes. Korean Journal of Chemical
Engineering 2004;21(2):338–344.
[32] Lee JM, Lee JH. Approximate dynamic programming-based approaches for input–output data-driven control of nonlinear
processes. Automatica 2005;41(7):1281–1288.
[33] Al-Tamimi A, Lewis FL, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming:
convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B) 2008;38(4):943–949.
[34] Si J, Wang YT. Online learning control by association and reinforcement. IEEE Transactions on Neural networks 2001;12(2):264–
276.
[35] Heydari A, Balakrishnan SN. Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics.
IEEE Transactions on Neural Networks and Learning Systems 2013;24(1):145–157.
[36] Wang D, Liu D, Wei Q, Zhao D, Jin N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive
dynamic programming. Automatica 2012;48(8):1825–1832.
[37] Vrabie D, Lewis F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear
systems. Neural Networks 2009;22(3):237–246.
[38] Vamvoudakis KG, Lewis FL. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem.
Automatica 2010;46(5):878–888.
[39] Liu D, Wang D, Li H. Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online
learning optimal control approach. IEEE Transactions on Neural Networks and Learning Systems 2014;25(2):418–428.
[40] Mu C, Wang D, He H. Novel iterative neural dynamic programming for data-based approximate optimal control design.
Automatica 2017;81:240–252.
[41] Luo B, Wu HN, Huang T, Liu D. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control
design. Automatica 2014;50(12):3281–3290.
[42] Wang D, Liu D, Zhang Q, Zhao D. Data-based adaptive critic designs for nonlinear robust optimal control with uncertain
dynamics. IEEE Transactions on Systems, Man, and Cybernetics: Systems 2016;46(11):1544–1555.
[43] Tang W, Daoutidis P. Distributed adaptive dynamic programming for data-driven optimal control. Systems & Control Letters
2018;120:36–43.
[44] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Proceedings of the
36
Advances in Neural Information Processing Systems; 2012. p. 1097–1105.
[45] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic policy gradient algorithms. In: Proceedings of
the 31st International Conference on Machine Learning Beijing, China; 2014. .
[46] Watkins CJCH. Learning from delayed rewards. PhD thesis, King’s College, Cambridge; 1989.
[47] Watkins CJ, Dayan P. Q-learning. Machine learning 1992;8(3-4):279–292.
[48] Sutton RS. Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Proceedings of the
Advances in Neural Information Processing Systems Denver, USA; 1996. p. 1038–1044.
[49] Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of LSTD. In: Proceedings of the 27th International Conference
on Machine Learning Haifa, Israel; 2010. p. 615–622.
[50] Tsitsiklis JN, Van Roy B. Analysis of temporal-diffference learning with function approximation. In: Proceedings of the Advances
in Neural Information Processing Systems Denver, USA; 1997. p. 1075–1081.
[51] Baird L. Residual algorithms: Reinforcement learning with function approximation. In: Machine Learning Proceedings 1995
Elsevier; 1995.p. 30–37.
[52] Fairbank M, Alonso E. The Divergence of Reinforcement Learning Al-gorithms with Value-Iteration and Function Ap-
proximation. arXiv preprint arXiv:11074606 2011;.
[53] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research
2013;32(11):1238–1274.
[54] Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient methods for reinforcement learning with function approxi-
mation. In: Proceedings of the Advances in Neural Information Processing Systems; 2000. p. 1057–1063.
[55] Degris T, White M, Sutton RS. Off-policy actor-critic. arXiv Preprint, arXiv:12054839 2012;.
[56] Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In: Reinforcement
Learning Springer; 1992.p. 5–32.
[57] Riedmiller M, Peters J, Schaal S. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In: Proceedings
of the International Symposium on Approximate Dynamic Programming and Reinforcement Learning Honolulu, Hawaii; 2007.
p. 254–261.
[58] Konda VR, Tsitsiklis JN. Actor-critic algorithms. In: Proceedings of the Advances in Neural Information Processing Systems
Denver, USA; 2000. p. 1008–1014.
[59] Bhatnagar S, Ghavamzadeh M, Lee M, Sutton RS. Incremental natural actor-critic algorithms. In: Proceedings of the Advances
in Neural Information Processing Systems Vancouver, Canada; 2008. p. 105–112.
[60] Hausknecht M, Stone P. Deep reinforcement learning in parameterized action space. arXiv Preprint, arXiv:151104143 2015;.
[61] Uhlenbeck GE, Ornstein LS. On the theory of the Brownian motion. Physical review 1930;36(5):823–841.
[62] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth
international conference on artificial intelligence and statistics 2010;1:249–256.
[63] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv Preprint,
arXiv:150203167 2015;.
[64] Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv Preprint, arXiv:14126980 2014;.
[65] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. Tensorflow: large-scale machine learning on heterogeneous
distributed systems. arXiv Preprint, arXiv:160304467 2016;.
[66] Skogestad S, Morari M, Doyle JC. Robust control of ill-conditioned plants: high-purity distillation. IEEE Transactions on
37
Automatic Control 1988;33(12):1092–1105.
[67] Waller JB, Boling JM. Multi-variable nonlinear MPC of an ill-conditioned distillation column. Journal of Process Control
2005;15(1):23–29.
[68] Mollov S, Babuska R. Analysis of interactions and multivariate decoupling fuzzy control for a binary distillation column.
International Journal of Fuzzy Systems 2004;6(2):53–62.
[69] Wang YG, Shi ZG, Cai WJ. PID autotuner and its application in HVAC systems. In: Proceedings of the American Control
Conference, vol. 3 Arlington, USA; 2001. p. 2192–2196.
[70] Wang Y, Velswamy K, Huang B. A Long-Short Term Memory Recurrent Neural Network Based Reinforcement Learning
Controller for Office Heating Ventilation and Air Conditioning Systems. Processes 2017;5(3):1–18.
[71] Moradi H, Saffar-Avval M, Bakhtiari-Nejad F. Nonlinear multivariable control and performance analysis of an air-handling unit.
Energy and Buildings 2011;43(4):805–813.
[72] Greensfelder EM, Henze GP, Felsmann C. An investigation of optimal control of passive building thermal storage with real time
pricing. Journal of Building Performance Simulation 2011;4(2):91–104.
[73] Underwood DM, Crawford RR. Dynamic Nonlinear Modeling of a Hot-Water-to-Air Heat Exchanger for Control Applications.
American Society of Heating, Refrigerating and Air-Conditioning Engineers; 1991.
[74] Pon Kumar SS, Tulsyan A, Gopaluni B, Loewen P. A Deep Learning Architecture for Predictive Control. In: Proceedings of
the 10th IFAC Conference on the Advanced Control of Chemical Processes Shenyang, China; 2018. .
[75] Wallace M, Pon Kumar SS, Mhaskar P. Offset-free model predictive control with explicit performance specification. Industrial
& Engineering Chemistry Research 2016;55(4):995–1003.
[76] Lee JH, Wong W. Approximate dynamic programming approach for process control. Journal of Process Control 2010;20(9):1038–
1048.
[77] Tulsyan A, Tsai Y, Gopaluni RB, Braatz RD. State-of-charge estimation in lithium-ion batteries: A particle filter approach.
Journal of Power Sources 2016;331:208–223.
[78] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. A particle filter approach to approximate posterior Cramer-Rao lower bound:
The case of hidden states. IEEE Transactions on Aerospace and Electronic Systems 2013;49(4):2478–2495.
[79] Tulsyan A, Gopaluni RB, Khare SR. Particle filtering without tears: a primer for beginners. Computers & Chemical Engineering
2016;95(12):130–145.
[80] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. Performance assessment, diagnosis, and optimal selection of non-linear state
filters. Journal of Process Control 2014;24(2):460–478.
[81] Tulsyan A, Khare S, Huang B, Gopaluni B, Forbes F. A switching strategy for adaptive state estimation. Signal Processing
2018;143:371–380.
[82] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. On simultaneous on-line state and parameter estimation in non-linear state-space
models. Journal of Process Control 2013;23(4):516–526.
[83] Tulsyan A, Huang B, Gopaluni RB, Forbes JF. Bayesian identification of non-linear state-space models: Part II-Error Analysis.
IFAC Proceedings Volumes 2013;46(32):631–636.
[84] Mayne DQ. Model predictive control: Recent developments and future promise. Automatica 2014;50(12):2967–2986.
[85] Bhasin S. Reinforcement learning and optimal control methods for uncertain nonlinear systems. University of Florida; 2011.
38