Post on 06-Jul-2018
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
1/12
21Sudan Engineering Society Journal, March 2013, Volume59; No.1Sudan
A DETAILED APPROACH TO REINFORCEMENT LEARNING:A SEMI-BATCH
REACTOR CASE STUDY
Mustafa Abbas Mustafa1 and Tony Wilson
2
1 Department of Chemical Engineering, Faculty of Engineering, University of Khartoum.
2 Chemical and Environmental Engineering, Faculty of Engineering, the University
of Nottingham, United Kingdom
Received Sep. 2012, accepted after revision Jan 2013
ص
ل خ
ت س م
َ
شي
انح
ث
خ
ً
هن
غشج
ً
نا
يانثح
ح
َ
آ
خ
ح
سح
رد
ً
َ
ش
و
يع
ي
ا
َ
هغ
ً
نا
أ
َ
رنك
يع
.ايم
انغم
ىق
.اا
ٍ
انشتح
ل
(ضص
ً
نا
انهى
Reinforcement Learningهح
ً
ُ
أ
نا
هنا
انهى
هح
ً
هن
ذ
ُ
)
ى
ي
.ثش
ً
نا
انثق
هح
ً
نم
تنم
ذ
ُ
نا
نزا
اح
ش
ُ
نا
ش
َ
ى
نى
.انثشج
اب
ل
ي
اناسصيح
ح
ي
انق
ضص
ً
نا
انهى
ذ
ُ
ن
ييم
يط
ش
نإ
ثنا
زا
ف
انثق
(ب
ً
نا
تشايذ
ل
ً
ت
سج
ُ
ي
نح
ساح
ه
MATLABم
ى
رنك
ه
ج
.)
.اناسصيح
سب
ي
هن
انح
انسة
خ
َ
ت
ح
ً
ي
Abstract
The transient nature of semi-batch reactors, coupled with the unavailability of
accurate mathematical models and online measurements, restricts achieving optimal
operation. However, one finds that operators have managed, through experience, toimprove on previous performance. Reinforcement Learning (RL) has already been
identified as an approach to mimic this interactive learning process. Core elements
have not been presented in detail for direct application. This work aims to provide a
blueprint of the RL approach and a validation, through MATLAB implementation,
against a published case study. Moreover, the initial training data set is modified to
confirm the convergence of the algorithm.
Keywords: Reinforcement Learning, ANN, Optimization, Control, Semi-Batch Reactor
1. Introduction
Batch processing is an important sector of
the chemical process industry. In
comparison to continuous processes, batch
processes relate to the production of fine or
specialty chemicals, pharmaceuticals,
biochemicals, and polymers. There has been
an increasing interest in multi-product
batch production, so as to adjust better to
changing market conditions [1]. Althoughdifferent degrees of instrumentation and
automation could be found in industry,
many batch reactors are still operated
manually [2]. Despite the existence of an
important amount of literature on batch
unit optimization using an exact process
model and optimal control methods these
methodologies are rarely part of everyday
industrial practice [3]. Conventional optimal
control methods based on perfect process
models and continuous measurements aredifficult to apply in the industrial
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
2/12
22 Sudan Engineering Society Journal, March 2013, Volume59; No.1
A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY
environment. The main reasons are the
scarcity (sometimes delay) of online
measurements; unavailability of accurate
mathematical models; batch to batch
variations (including raw material
variability); inherent unsteady-state nature
of Batch processes [3].
Despite of all this, many industrial processes
are operated with acceptable levels of
performance by human operators. The
operators use a combination of good
engineering insight, judgment and ability to
learn incrementally to define, implement,
and improve control policies for a great
variety of process tasks. On the other hand,
maintaining consistent quality becomes
difficult, due to changes in operations from
shift to shift and the difference in skill level
in operators. This results in the need to
develop methodologies and software tools
that can provide automation in batch
process units.
Martinez et al. [4,5,6] recognized the
potential use of the Reinforcement Learning(RL) algorithm to batch chemical processes,
and applied the algorithm to a semi-batch
reactor case study. However, no detailed
explanation was provided which provided
the main impetus for this work. The RL
algorithm is implemented using MATLAB
and compared against the same case study
reported by Martinez et al. [4], primarily to
validate the RL algorithm, but also to extend
results obtained.
2. Reinforcement Learning
In the late 1980's, Reinforcement Learning
emerged as an integration of three threads
that had been pursued independently. The
threads are: Trial and error learning;
Optimal control; Temporal-difference
learning methods.
The first thread started in the psychology of
animal learning, and revolved around the
trial and error nature of their learning.
Sutton and Barto's [7] review shows how
the first to express the essence of trial and
error learning was Edward Thorndike [8],
who was quoted to have said:
"of several responses made to the same
situation, those which are accompanied or
closely followed by satisfaction to the
animal will, other things being equal, be
more firmly connected with the situation, so
that, when it recurs, they will be more likely
to recur; those which are accompanied or
closely followed by discomfort to the animal
will, other things being equal, have their
connections with that situation weakened,
so that, when it recurs, they will be less
likely to occur. The greater the satisfaction
or discomfort, the greater the strengthening
or weakening of the bond ".
In essence trial and error learning followed
by good or bad outcomes alter their
tendency to be reselected. Thorndike called
this the "Law of Effect", since it describes
the effect of reinforcing events on thetendency to select actions. The two most
important aspects of trial and error learning
are: It is selectional, in other words, it
involves trying alternatives and selecting
among them by comparing their
consequences; it is associative, in the sense
that alternatives found by selection are
associated with particular situations.
Hence, the combination of those twoaspects is essential to trial and error
learning, as it is to the Law of Effect. In
other words, the Law of Effect is an
elementary way of combining search of
many actions in a given situation, and
memory of the best actions in the specific
situations. This thread makes up a big part
of the work in Artificial Intelligence, and led
to renewed interest in Reinforcement
Learning in the early 1980s.
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
3/12
Sudan Engineering Society Journal, March 2013, Volume59; No.1
Mustafa Abbas Mustafa andTony Wilson
23
The second thread deals with the optimal
control problem and the use of value
functions by Dynamic Programming
solutions. The term "optimal control" was
first used in the 1950s to describe the
problem of designing a controller, so as to
minimize a measure of a dynamical system's
behavior over time. In the mid-1950s,
Richard Bellman developed one solution to
this problem. His approach uses the concept
of a dynamical system's state, and a Value
Function to define a set of functional
equations, referred to now as the Bellman
equations. Later, the class of methods for
solving optimal control problems by solvingthe Bellman equations came to be known as
Dynamic Programming [9].
Markovian Decision Processes (MDPs), a
discrete stochastic version of the optimal
control problem were introduced by
Bellman [10]. Howard [11] later devised the
Policy Iteration Method for MDPs. All of
these make up the basic elements
underlying the theory and algorithms ofReinforcement Learning. The literature
contains many developments relating to
Dynamic Programming e.g. Bertsekas [12],
Puterman [13], Ross [14]. In addition,
Bryson [15] provides a history of optimal
control.
The final thread is a smaller and less distinct
thread concerning Temporal-Difference
methods of learning. Temporal Difference
(TD) methods are a general framework for
solving sequential prediction and control
problems, whereby an agent learns by
comparing temporally successive
predictions. The important part is that the
agent can learn before seeing the final
outcome. Nowadays, the field of Temporal
Difference covers more general methods for
learning to make long-term predictions
about dynamical systems (Sutton and Barto
[7], Watkins [16], Werbros [17]). This may
be particularly relevant in predicting
financial data, life spans, and weather
patterns.
2.1 RL Optimization Problem
Following the book on the subject by Sutton
and Barto [7] one could define
Reinforcement Learning, as simply being the
mapping of situations to actions so as to
maximize a numerical reward. An important
point to add is that the learner (e.g.
operator) is not told which actions to take
but must explore, between different
options, and exploit, what he already knows
about the process, to discover actions that
yield the highest reward.
The main elements of Reinforcement
Learning comprise of an agent (e.g.
operator, software) and an environment.
The agent is simply the controller, which
interacts with the environment by selectingcertain actions. The environment then
responds to those actions and presents new
situations to the agent. The agent’s
decisions are based on signals from the
environment, called the environment's
state. Figure 1 shows the main framework
of Reinforcement Learning. This is a typical
Reinforcement Learning problem,
characterized by:
1. Setting of explicit goals.
2.
Breaking of problem into decision steps.
3. Interaction with environment.
4. Sense of uncertainty.
5.
Sense of cause and effect.
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
4/12
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
5/12
Sudan Engineering Society Journal, March 2013, Volume59; No.1
Mustafa Abbas Mustafa andTony Wilson
25
the approximations of Q* and a* become
closer and closer to the actual values. After
completion of learning of the Value
Function, the Reinforcement Learning
algorithm is used to compute the optimal
action at every state.
An overview of the RL algorithm identifies
the following components and concepts:
1. Value Function: Objective function
reflecting how good/bad it is to be at a
certain state and taking a given action.
2. Bellman Optimality Equations:
Convergence criteria.
3.
Wire Fitting: Approximating the Value
Function
4.
Neural Network: Part of the Wire Fitting
approximation.
5.
Predictive Models: Used at each stage,
instead of the actual model, to provide a
one step-ahead prediction of states.
An initial training data set is provided and
the RL algorithm is executed offline.Following the completion of the learning
phase, the RL algorithm is implemented
online. The control policy is then to
calculate the optimal action a* for every
state encountered during progress of the
batch run. At the end of the batch run, the
training data set is updated, followed by
update of the predictive models and testing
of convergence criteria.
2.3 RL Methodology: A Detailed description
The criteria used for convergence, is the
Bellman Optimality Equation, (Equation 3).
}(3)Since the rewards (rt) are not known, in
advance, until the run has been completed,
r ( st, a) = 0, t < T was imposed. Also, is
set to 1, since the problem breaks downinto episodes. Hence, the Bellman
Optimality Equation could be rewritten as
follows:
} (4)The Value Function is calculated in general
using the following relationships:
{
(5)
Where PI is the Performance Index (a
function of the final conditions at time T).
Penalty of -1 is nominal value and it may be
appropriate to use other values in particular
problems
Since the main aim of the algorithm is
defining the optimal actions which result in
the optimal value function, Equation 5 could
be rewritten as follows (Since the goal is
always achieved with an optimal policy (*)
-hence the Value Function never equals -1):
(6) Equation 6 is true only when the RL
algorithm converges to the actual optimal
value function. During incremental learning
of the optimal value function, differences
occur which define the error: Bellman error.
The mean squared Bellman error, EB, is then
used in the approach to drive the learning
process to the true optimal value function(Equation 7 defines EB for a given state-
action pair (st,at)).
{ ,* +-
2.3.1 Implementation of Predictive Model
Martinez et al. [4] proposed the use of a
predictive model as part of the
Reinforcement Learning algorithm. The
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
6/12
26 Sudan Engineering Society Journal, March 2013, Volume59; No.1
A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY
most important requirement of a predictive
model is the ability to generalise properly
during learning. Otherwise the Value
Function may provide a poor control policy.
This problem is even worse for batch
processes showing significant batch-to-
batch variations in their behaviour (e.g.
polymerisation reactors). In general, the use
of pure inductive models (e.g. regression
models) is quite prone to poor predictions,
depending on the richness of experimental
conditions contained in the training data
set. The idea of hybrid modelling (Psichogios
and Ungar [18]; Thompson and Kramer
[19];Tsen et al. [3]) gives a differentapproach to improving the prediction
capability of an inductive model. Two
important requirements for any hybrid
prediction model to be used in batch
process optimization are:
1. The strong support of prediction in
experimental data points.
2. The ability to exploit qualitative
information from an imperfect processmodel based on first principles or other
sources of knowledge.
Process models based on first principles,
although numerically imprecise, are able to
capture the qualitative trends of process
variables quite well. Martinez et al. [4]
recognized the potential of using hybrid
predictive models in Reinforcement
Learning, and adapted a hybrid schemeproposed by Tsen et al. [3]. The proposed
predictive model makes use of an imperfect
process model f mod to introduce correction
terms based on local process trends. The
state change from st
a
to st
a
1caused by action
at
a
is extrapolated from an experimental
pair ( e
t
e
t a,s )as follows:
(8)Where
stands for the inductive model
(local regression model). The mainadvantage of Equation 8 is that partial
derivatives with respect to states and
actions are calculated using first principles
incorporated into the imperfect process
model . Hence, for example a processmodel based on nominal parameters could
be used to approximately calculate the
expected change in the Performance Index
when action at is taken and process state is
st + st in place of st. This would significantly
improve the prediction capability of the
regression model. If state-action data pairs
are assumed to be stratified with regard to
the stage-wise decision procedure, there
will exist several pairs with the same index
“t ” that are able to provide a prediction for
the state at “t +1”. Furthermore, the average
from extrapolations corresponding to all
those experimental pairs having the sameindex t is used. This average is defined by
weighting each individual extrapolation
(Equation 9) with a factor that measures the
proximity from the corresponding
experimental condition to ( e
t
e
t a,s ). The next
state prediction is then calculated using:
∑ ( ) ( ) (9) where is the set of all experimental pairs
associated with the t -the decision stage and
each j is a weighting factor defined as
√ . /. /
∑ √ . /. / (10)
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
7/12
Sudan Engineering Society Journal, March 2013, Volume59; No.1
Mustafa Abbas Mustafa andTony Wilson
27
2.3.2 Wire fitting function
Wire Fitting (Baird and Klopf [20]) is a
function approximation method, which is
specifically designed for self learning control
problems where simultaneous learning and
fitting of a function takes place. It is
particularly useful to Reinforcement
Learning systems, due to the following
reasons: It provides a way for approximating
the Value Function; It allows the maximum
value function to be calculated quickly and
easily.
Wire Fitting is an efficient way of
representing a function because it fits
surfaces using wires as shown in Figure 3. In
addition, they are even more attractive to
use in Reinforcement Learning algorithms
because it reduces the computational
requirements even further by providing the
Value Function directly
Q(s,a)
s a
Figure 3: Wire Fitting using 3 wires
The function Q (s,a) is supported by 3 wires,
although any number of wires could be
specified. Taking a 2D slice of Figure 3
shows the 3 wires, represented by the 3
black circles, which are called support
points, as shown in Figure 4
Q=f(a)
a
(Q3, a3)
(Q2, a2)(Q1, a1)
Figure: 4: Two dimensional slice of WireFitting surface
Baird and Klopf [20] suggested the use of
any general function approximation system
for the "Learned Function" shown in Figure
5, to learn the relation between different
states and control points (a1(s), Q 1(s), a2(s),
Q 2(s), a3(s), Q 3(s)).
Figure 5: Wire Fitting Architecture
The Interpolated Function is then defined by
a weighted-nearest-neighbour interpolation
of the control points as follows:
∑ [| | ( )]
∑ ,| | - Where the constant c determines theamount of smoothing for the approximation
and m defines the number of control wires.
Baird and Klopf [20] then suggested the use
of arbitrary values for the control points at
the beginning of the learning process. This
results in a long duration for the learning
process, which is impractical in industrial
operations where quick achievement andoptimization of process goals is required.
2.3.3 Calculations of Mean Squared
Bellman Error
To calculate the mean squared Bellman
error for states at this stage, both the values
of PI*, and Q*(s , a*) are required (Equation
6) for each state at T-1 in the current
training data set. The value PI* represents
the optimal Value Function for a state at T-1
and is calculated as shown in Figure 6 using
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
8/12
28 Sudan Engineering Society Journal, March 2013, Volume59; No.1
A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY
the predictive model (PM1) and running an
optimization routine (given the constraints)
between the maximum and minimum
values of the action (a min to a max).
T
S T-1
T-1
ST1
ST3
S T
a min
a max
PM1
PM1
e.g. state atwhichoptimum
value occurs
Search between min and maxvalue of action, using PM1
Figure 6: Calculation of optimal value
function (PI*) and optimal action (a*) given
a state atT-1where PM1 is the predictive
model for states at stage T-1 to T.
The optimization results in identification of
the best action a* which would lead to the
optimal value function PI* for each state at
T-1. The next step is to calculate the Value
Function, Q*(s, a*), which represents the
current Wire Fitting approximation of PI*
for states at T-1. This is achieved by
supplying the optimal action (a*), together
with the current state to the Wire Fittingapproximation function as input (Figure 7).
Neural Network
Interpolation Function
a*
Q*(s, a*)
s
m
i
ii
m
i
iiii
Q sQ saa
Q sQc saaQ
a sQ*
1
1
max
1
1
max
(s))()(*
)(s))(()(*(s)
*),(
a3Q3(s)
PM1
a2Q2(s)
PM1
a1Q1(s)
PM1
s ss
Figure 7: Calculation of optimal value
function approximation using Wire Fitting
for state s and optimal action a* at T-1
The calculation of the Value Functions Q 1(s),
Q 2(s), and Q 3(s) in Figure 7 is further
demonstrated in Figure 8. Let us take
state:1 . The Neural Network will supply 3 actions
(a1 a2, a3 shown in Figure 8).
2 . The predictive model will supply 3 final
states (ST1, ST2 ST3).
3. Consequently, given the 3 final states, the
3 value functions (Q 1(s), Q 2(s), and Q 3(s))
can be calculated by calculating the
performance indices.
T
S T 1
T-1
a3
a2
a1 S T1
S T3
S T2PM1
PM1
PM1
Q3(s)
Q2(s)
Q1(s)
Figure 8: Calculation of support points
(Q 1(s), Q 2(s), and Q 3(s)) for use in Wire
Fitting approximation of optimal value
function for states at T-1
A similar procedure is used to calculate the
mean squarred Bellman error for all states
at stages T-2 and T-3.
2.3.4 Back Propagation of Mean Squared
Bellman Error
After the calculation of the mean squared
Bellman errors, the weights and biases in
the Neural Network are modified as follows:
. / (12)using the chain rule,
. / (13)Furthermore, the derivative of the Value
Function with respect to the output of the
Neural Newtwork, is obtained analyticaly
by diferentation of Equation 11.
Back propagation of mean squared Bellman
error is performed initially for states at T-1
only. Once training is achieved for this
subset of state-action pairs, states at T-2 are
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
9/12
Sudan Engineering Society Journal, March 2013, Volume59; No.1
Mustafa Abbas Mustafa andTony Wilson
29
added in the training data set. This is then
finally followed by inclusion of the initial
state. During this stage-wise training
procedure, the back propagation of mean
squared Bellman error (EB) is continued until
it is below a certain tolerance for all the
states in the training data set. Furthermore,
each time a new experimental run is
available, the whole procedure is repeated.
MATLAB [21] is used to implement the
algorithm.
2.3.5 On-line Application of RL Algorithm
Following the learning of the optimal value
function for the current training data set,
the RL algorithm calculates the optimal
actions for states at different time intervals
and the corresponding optimal value
function. Given a state at a certain stage,
the algorithm will perform a forward sweep
and select the path that will lead to the
highest resultant value function as shown in
Figures 9 to 11 (The black dots show
examples of the optimal route for sample
states at the different time intervals). Itshould be noted that the use of only the
Neural Network and Predictive Models
(PM1, PM2, and PM3) is required for on-line
applications.
Figure 9: Search for optimal path for states
at T-1
Figure 10: Search for optimal path for
states at T-2
Figure 11: Search for optimal path for
states at T-3
3 Case Study
The case study considered by Martinez et al.
[4], is a semi-batch reactor. The main
product is formed according to an
autocatalytic reaction scheme, with a
kinetic mechanism for the main reaction asfollows:
→ (14)Meanwhile, a slower irreversible
degradation of the product simultaneously
takes place as following:
→ (15)The concentration of Product (B) can be
measured fast enough and hence is used for
the on-line optimization. As for the
assessment of the impurity level, it is
analyzed in the final productonly, because it
is costlyand time-consuming. The quality of
the product is on-specification, if less than
2% of B is lost to impurity. Otherwise, it is a
wasted batch and a penalty of -1 is applied.
In addition, there is a preference for
achieving the maximum possible conversionof A.
In order to guarantee the quality of the
batch, the feed flow rate is changed
according to intermediate measurement
samples of product B formation. Three
samples corresponding to accumulated
total liquid volumes of V=0.2 Vf , V=0.4 Vf ,
and V=0.6 Vf are taken to measure the
concentration of B. The result of each
sample is only available with a delay of 30
T
T
T
T-1
T-2
T
T-1
T-2
T-3
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
10/12
30 Sudan Engineering Society Journal, March 2013, Volume59; No.1
A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY
minutes. Other relevant data for the case
study is provided in Table 1 The process
goal is then defined as achieving the
product within specifications in less than 5
hours, with a conversion of higher than 90%
for the reactant A fed. Once the goal is
achieved, the Preference Index (PI) is
defined to have 3 units for each additional
percent conversion over 90% obtained and
1 unit for each hour reduction with respect
to the maximum reaction time. Hence, if an
on-specification product is obtained in 3
hours and with 92 % of reactant conversion,
then the goal is achieved and PI=8.
Table 1: Semi-batch reactor case study
Initial conditions
Vol.=0.5 m3; CA=1.92 kmol m
-3 ;
CB=0.550 kmol m-3
Reactor feed
CA=1.42 kmol m-3
; CB=0.75 kmol m-3
.
Kinetic parameters
k1= k11 )/.exp( 1 T E [m6 h
-1 kmol
-2]
K2= k22 )/.exp( 2 T E [h-1
]
Nominal parameters
k1=10 )/.000,1exp( T [m6 h
-1 kmol
-2]
K2=2.2 1015 )/.400,13exp( T [h-1]
Operating constraints
Feed rate 1.5 m3 h
-1; Temp. 50
0C;
Max. Volume=5 m3
4. Results and Discussion
Using the same initial training data set of
six batch runs used by Martinez et al. [4],
the RL algorithm was executed for an
additional 26 batch runs. The results
obtained from implementing the predictive
models are shown in Figure 12, together
with the results obtained by Martinez et al.
[4]. The same final PI value is approximately
achieved by both RL applications with
similar trends followed though the RL
algorithm developed here manages to
converge more steadily.
The initial training data set used by
Martinez et al. [4] contained a very good
batch run with a Performance Index equal
to 15.09. Hence it could be argued that the
RL algorithm was not actually incrementally
improving (or learning), but was rather
interpolating between previous good batch
runs. So the best batch run in the training
data set was replaced with another batch
run, which leads to the initial training
dataset having a maximum PerformanceIndex value equal to 9.15.
The implementation of the RL algorithm
was then repeated for the new training data
which shows how the RL algorithm manages
to achieve a much higher value of the
Performance Index of 11.36 in comparison
Figure 12: Validation of RL Methodology
-2
0
2
4
6
8
10
12
14
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Number of batches
PerformanceIndex
Results produced by Martinez (1998a)
Results obtained from
implementing RL methodology
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
11/12
Sudan Engineering Society Journal, March 2013, Volume59; No.1
Mustafa Abbas Mustafa andTony Wilson
31
Figure 13: Performance Index as a function of number of additional batch runs (best batch run
in initial training data set results in a PI equal to 11.36)
to the previous best batch run in the
training data set (Figure 13). This issignificant in that it demonstrates capability
of the RL algorithm to improve beyond the
level observed in previous batches.
5. Conclusion
Reinforcement Learning provides a new
approach towards the full automation of
stochastic discrete systems such as batch
chemical processes. Those systems provide
clear challenges due to their inherenttransient behavior, unavailability of online
measurements in addition to delays in
measurements which will degrade the
performance of any process control system.
This work presents an in-depth description
of RL for direct application of the algorithm.
Furthermore, MATLAB implementation of
the RL approach was validated against a
published case study [4]. The results haveshown good agreement with literature in
addition to further assurance of
improvement in performance beyond the
best run already presented in the initial
training data set.
References
1. D. Bonvin, Optimal Operation of Batch
Reactors: A Personal View, J. Process
Control. 8 (1998) 355-368.
2.
Terwiesch, P., M. Agarwal and D.W.T
Rippin, Batch Unit Optimization withImperfect Modeling: a Survey, J. Process
Control. 4 (1994) 238-258.
3. A.Y. Tsen, S.S. Jhang, D.S. Wong, and B.
Joseph, Predictive Control of Quality in
Batch Polymerization using ANN Models,
AIChE Journal. 42 (1996) 455-465.
4. E.C. Martinez, J.A. Wilson and M.A.
Mustafa, An Incremental Learning
Approach to Batch Unit Optimisation,
The 1998 IChemE Research Event,
Newcastle, UK.
5. E.C. Martinez, R.A. Pulley and J.A.
Wilson, Learning to Control the
Performance of Batch Processes, Chem.
Eng. Res. Des., 76 (1998) 711-722.
6.
E.C. Martinez and J.A. Wilson, A Hybrid
Neural Network First Principles Approach
to Batch Unit Optimisation, Comput.
Chem. Eng., Suppl. 22 (1998) S893-S896.
7. R.S. Sutton and A.G. Barto,
Reinforcement Learning: An
Introduction, The MIT Press, Cambridge,
Massachusetts, London, England, 1998.
8. E.L. Thorndike, Animal Intelligence,
Hafner, Darien, Conn., 1991.
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Number of additional batches added to intial training data set
PerformanceIndex
8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013
12/12
32 Sudan Engineering Society Journal, March 2013, Volume59; No.1
A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY
9.
R. Bellman, Dynamic Programming,
Princeton University, Press, Princeton,
New Jersey, 1957.
10. R.E. Bellman, A Markov Decision Process,
J. Math. Mech. 6 (1957) 679-684.
11. R. Howard, Dynamic Programming and
Markov Processes, MIT Press,
Cambridge, MA, USA, 1960.
12. D.P. Bertsekas, Dynamic Programming
and Optimal Control, Athena Scientific,
Belmont, MA, USA, 1995.
13.
M.L Puterman, Markov Decision
Problems, Wiley, New York, USA, 1994.
14. S. Ross, Introduction to Stochastic
Dynamic Programming, Academic Press,
New York, USA, 1983.
15. A.E. Bryson, Optimal control-1950 to
1985, IEEE Control Systems Magazine. 16
(1996) 26-33.
16.
C.J.C.H Watkins, Learning from Delayed
Rewards, PhD thesis, Cambridge
University, UK, 1989.
17.
P.J. Weberos, Building and
Understanding Adaptive Systems: A
Statistical/Numerical Approach to
Factory Automation and Brain Research,
IEEE Transactions on Systems, Man, and
Cybernetics. 17 (1987) 7-20.
18. D.C. Psichogios and L.H. Ungar, A Hybrid
Neural Network First Principles Approach
to Process Modelling, AIChE Journal. 38
(1992) 1499-1511.
19.
M.L. Thompson and M.A. Kramer,
Modeling Chemical Processes using Prior
Knowledge and Neural Networks, AIChE
Journal. 40 (1994) 1328-1340.20. L.C. Baird and A.H. Klopf, Reinforcement
Learning with High-dimensional
Continuos Actions, Technical Report WL-
TR-93-1147, Wright Laboratory, Wright
Patterson Air Force Base, 1993.
21. MATLAB: Version 4 User's Guide. The
Math Works Inc, Natick, Massachusetts,
USA, 1995.