Reinforcement learning framework for adaptive control of nonlinear chemical processes

9
ASIA-PACIFIC JOURNAL OF CHEMICAL ENGINEERING Asia-Pac. J. Chem. Eng. 2011; 6: 138–146 Published online 23 August 2010 in Wiley Online Library (wileyonlinelibrary.com) DOI:10.1002/apj.502 Special Theme Research Article Reinforcement learning framework for adaptive control of nonlinear chemical processes Hitesh Shah* and Madan Gopal Electrical Engineering Department, IIT-Delhi, New Delhi-110016, India Received 1 September 2009; Revised 23 June 2010; Accepted 6 July 2010 ABSTRACT: This article addresses the problem of adaptive control of nonlinear chemical processes with time-varying dynamics. The Q -learning and policy-iteration algorithms have been considered in the reinforcement learning (RL) framework. The performance of the two algorithms has been tested on highly nonlinear simulated continuous stirred tank reactor (CSTR). Comparison with conventional methods shows that the RL techniques are able to achieve better performance, and robustness against uncertainties. The policy-iteration algorithm is relatively faster in convergence, and better in robustness. 2010 Curtin University of Technology and John Wiley & Sons, Ltd. KEYWORDS: reinforcement learning control; Q -learning; policy iteration; CSTR INTRODUCTION Model predictive control (MPC) is the most popular advanced control technique in the process industry. [1] The MPC approach determines a sequence of actions based on predictions using the system model that max- imizes the performance (reward) of the system in terms of the desired behavior. A model is, however, always an approximation of the system under consideration. Predictions about the behavior of the system become increasingly more inaccurate when considered further in the future. To deal with this, MPC techniques use a rolling horizon to increase robustness. The rolling hori- zon principle consists of synchronizing the state of the model with the state of the true system at every decision step. At every decision step, the MPC agent observes the state of the true system and synchronizes the estimate that it has of the state of the system with this, and tries to find the best sequence of actions, given the updated state. Typically, the agent only executes the first action of this sequence. It then observes the system state again and finds a new sequence of actions. Because of the rolling horizon, the MPC agent has to find a sequence of actions at each decision step. This can be an intractable procedure when the hori- zon over which the agent has to determine actions is infinite. A finite horizon is therefore assumed. How- ever, because of the limited horizon over which actions *Correspondence to : Hitesh Shah, Electrical Engineering Depart- ment, IIT-Delhi, New Delhi-110016, India. E-mail: [email protected] are considered, the resulting policy may be suboptimal. The smaller the control horizon used to reduce on-line computations, the more suboptimal the resulting solu- tion may become. The MPC algorithm therefore might suffer from the dilemma of very high computational requirements vs suboptimality. The reinforcement learning (RL) framework [2] is a means to deal with issues arising in conventional MPC – computational requirements and suboptimality of actions. In RL, experience is built up over time through interaction with the system the agent has to control, rather than assumed available a priori (system model). The experience is based on the performance indicators that give information about how well a certain action was in a certain state of the system. The experience is also based on the state transitions of the system under actions taken. The performance function (value function) is approximated by keeping track of the performance obtained at each decision step considering the system state, performed action, and the resulting system state. At each decision step, the value function of the previous decision step is updated with experience built up over that previous decision step. By accumulating sufficient experience, the agent may accurately estimate the true value function. Thus, once the value functions are known well, an RL problem reduces to an MPC problem with a control horizon of only length 1. At the same time, decisions are based on infinite-horizon information. This takes care of both the issues associated with conventional MPC. In this article, the problem of MPC of chemical processes is considered in an RL framework. The 2010 Curtin University of Technology and John Wiley & Sons, Ltd. Curtin University is a trademark of Curtin University of Technology

Transcript of Reinforcement learning framework for adaptive control of nonlinear chemical processes

Page 1: Reinforcement learning framework for adaptive control of nonlinear chemical processes

ASIA-PACIFIC JOURNAL OF CHEMICAL ENGINEERINGAsia-Pac. J. Chem. Eng. 2011; 6: 138–146Published online 23 August 2010 in Wiley Online Library(wileyonlinelibrary.com) DOI:10.1002/apj.502

Special Theme Research Article

Reinforcement learning framework for adaptive controlof nonlinear chemical processes

Hitesh Shah* and Madan Gopal

Electrical Engineering Department, IIT-Delhi, New Delhi-110016, India

Received 1 September 2009; Revised 23 June 2010; Accepted 6 July 2010

ABSTRACT: This article addresses the problem of adaptive control of nonlinear chemical processes with time-varyingdynamics. The Q-learning and policy-iteration algorithms have been considered in the reinforcement learning (RL)framework. The performance of the two algorithms has been tested on highly nonlinear simulated continuous stirredtank reactor (CSTR). Comparison with conventional methods shows that the RL techniques are able to achieve betterperformance, and robustness against uncertainties. The policy-iteration algorithm is relatively faster in convergence,and better in robustness. 2010 Curtin University of Technology and John Wiley & Sons, Ltd.

KEYWORDS: reinforcement learning control; Q-learning; policy iteration; CSTR

INTRODUCTION

Model predictive control (MPC) is the most popularadvanced control technique in the process industry.[1]

The MPC approach determines a sequence of actionsbased on predictions using the system model that max-imizes the performance (reward) of the system in termsof the desired behavior. A model is, however, alwaysan approximation of the system under consideration.Predictions about the behavior of the system becomeincreasingly more inaccurate when considered furtherin the future. To deal with this, MPC techniques use arolling horizon to increase robustness. The rolling hori-zon principle consists of synchronizing the state of themodel with the state of the true system at every decisionstep. At every decision step, the MPC agent observes thestate of the true system and synchronizes the estimatethat it has of the state of the system with this, and triesto find the best sequence of actions, given the updatedstate. Typically, the agent only executes the first actionof this sequence. It then observes the system state againand finds a new sequence of actions.

Because of the rolling horizon, the MPC agent hasto find a sequence of actions at each decision step.This can be an intractable procedure when the hori-zon over which the agent has to determine actions isinfinite. A finite horizon is therefore assumed. How-ever, because of the limited horizon over which actions

*Correspondence to: Hitesh Shah, Electrical Engineering Depart-ment, IIT-Delhi, New Delhi-110016, India.E-mail: [email protected]

are considered, the resulting policy may be suboptimal.The smaller the control horizon used to reduce on-linecomputations, the more suboptimal the resulting solu-tion may become. The MPC algorithm therefore mightsuffer from the dilemma of very high computationalrequirements vs suboptimality.

The reinforcement learning (RL) framework[2] isa means to deal with issues arising in conventionalMPC – computational requirements and suboptimalityof actions. In RL, experience is built up over timethrough interaction with the system the agent has tocontrol, rather than assumed available a priori (systemmodel). The experience is based on the performanceindicators that give information about how well acertain action was in a certain state of the system.The experience is also based on the state transitionsof the system under actions taken. The performancefunction (value function) is approximated by keepingtrack of the performance obtained at each decision stepconsidering the system state, performed action, and theresulting system state. At each decision step, the valuefunction of the previous decision step is updated withexperience built up over that previous decision step.By accumulating sufficient experience, the agent mayaccurately estimate the true value function.

Thus, once the value functions are known well, anRL problem reduces to an MPC problem with a controlhorizon of only length 1. At the same time, decisions arebased on infinite-horizon information. This takes care ofboth the issues associated with conventional MPC.

In this article, the problem of MPC of chemicalprocesses is considered in an RL framework. The

2010 Curtin University of Technology and John Wiley & Sons, Ltd.Curtin University is a trademark of Curtin University of Technology

Page 2: Reinforcement learning framework for adaptive control of nonlinear chemical processes

Asia-Pacific Journal of Chemical Engineering ADAPTIVE CONTROL OF NONLINEAR CHEMICAL PROCESSES 139

section Control and Optimization Method gives anoverview of an RL framework, the RL algorithms: valueand policy iteration, and model learning for the policy-iteration algorithm. The section Model and Data forSimulation gives details of the continuous stirred tankreactor (CSTR) model and data for simulation. Thesection Simulation Results and Discussions comparesand discusses the empirical performance on the basisof simulation results. Conclusions are presented in thefinal section.

CONTROL AND OPTIMIZATION METHOD

Learning framework

RL is a kind of simulation-based optimization tech-nique, which can be used to solve sequential decision-making problems that can be modeled as finite Markovdecision processes (MDPs).[3] An MDP is defined asa 6-tuple M = 〈S,A, P , R,D, γ 〉, where S = {s1, s2,

. . . , sn} and A = {a1, a2, . . . , am} are finite sets of stateand action spaces, respectively. The agent starts insome initial state s0 ∈ S generated from an initial statedistribution D. At each time step t , it perceives thecurrent state st ∈ S of uncertain stochastic environ-ment, takes an action at ∈ A, receives an immediatereward (reinforcement) signal rt ∈ R, and reaches thenext state st+1 ∈ S according to the transition proba-bility Pat

st st+1= Pr(st+1|st , at ); then it repeats the same

process from state st+1. The goal of the agent is to selectactions sequentially to maximize the expected sum ofdiscounted return:

R0 = E

[ ∞∑t=0

γ t rr

](1)

where the discount factor γ ∈ [0, 1) has the effect ofvaluing future rewards less than the current reward. Inthis infinite-horizon formulation, there is no reason tobehave differently in the same state, at different times.Hence, the optimal action depends only on the currentstate, and the optimal policy is stationary. For an agentin stationary environment, there is at least one optimalpolicy that is stationary and deterministic. An RL agenttries to learn this optimal policy, not necessarily unique,which maximizes the expected, total, discounted returnfor any initial state. It is, therefore, sufficient to restrictlearning the optimal policy only within the space ofdeterministic policies.

A deterministic policy π for an MDP is the set ofstate–action rules adopted by the agent to select actionsat each state: π : S → A. For a given policy π , wedefine V π(s), the state-value function, as the expected,

discounted, total return received by following policy πfrom the state s:

V π(s) = Eπ

[ ∞∑t=0

γ t rt |s0 = s

](2)

where the expectation is taken over all possiblestate–action sequences generated by following π . Thepolicy value V π(s0) is a measure of a policy’s perfor-mance, which is defined as the expected return obtainedby following π from the start state s0. Similarly, theaction-value function Qπ(s, a) is defined over all pos-sible combinations of states and actions and indicatesthe expected, discounted, total return received by tak-ing action a in the state s and following policy π

thereafter. The exact Q-values for all state–action pairscan be found by solving the linear system of Bellmanequations:

Qπ(s, a) = R(s, a) + γ∑s ′∈S

P(s, a, s ′)Qπ(s ′, π(s ′))

(3)where P(s, a, s ′) is the probability of making a transi-tion to state s ′ when taking action a in state s(s

a→s ′);and R(s, a, s ′) is the reward for the transition (s

a→s ′);R(s, a) = ∑

s ′∈SP(s, a, s ′)R(s, a, s ′).

The state-value function V ∗(s) and action-value func-tion Q∗(s, a) of an optimal policy, π∗, is the fixed pointof the nonlinear Bellman optimality equations:

V ∗(s) = maxa

∑s ′

P(s, a, s ′)[R(s, a, s ′) + γ V ∗(s ′)]

(4)

Q∗(s, a) =∑

s ′P(s, a, s ′)[R(s, a, s ′)

+ γ maxa ′ Q∗(s ′, a ′)] (5)

Given a complete and accurate model of an MDP inthe form of knowledge of the state transition probabili-ties P(s, a, s ′) and immediate rewards R(s, a, s ′) for allstates s and all actions a ∈ A(s), it is possible – at leastin principle – to solve the decision problem by apply-ing one of the RL algorithms: value iteration or policyiteration.

Value iterationValue iteration is a method of approximating theQ∗(s, a) values arbitrarily closely by iterating theEqns (4) and (5). The optimal policy can be constructedsimply by finding the maximizing action on eachstate, π∗(s) = arg max

aQ∗(s, a). This does, however,

require that a model be given (or is learned). Watkins[4]

proposed a procedure called Q-learning, to iteratively

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 3: Reinforcement learning framework for adaptive control of nonlinear chemical processes

140 H. SHAH AND M. GOPAL Asia-Pacific Journal of Chemical Engineering

update Q-values that does not require a system modeland is given by

Q(s, a) ←−− (1 − α)Q(s, a) + α[R + maxa ′ Q(s ′, a ′)]

(6)where α ∈ (0, 1] is the learning rate. Q-learning guar-antees convergence to optimal Q-values, Q∗, as longas every state–action pair is visited infinitely often andthe learning-rate parameter 0 ≤ α ≤ 1 is reduced to asmall value at a suitable rate.[4]

The guaranteed convergence of value iteration tothe optimal Q-values relies heavily upon a tabular(exact) representation of the value function. Tabularrepresentation is impractical for large state and actionspaces. In such cases, the tabular representation of thereal-valued function Q(s, a) is replaced by a functionapproximator Q̂(s, a; w), where w are the adjustableparameters of the approximator. The crucial factor fora successful approximate Q-learning algorithm is thechoice of the function approximation architecture, andthe choice of the parameter adjustment method: off-the-shelf solutions are available.

Policy iterationPolicy iteration [5] is another method of discovering anoptimal policy for any given MDP. Policy iteration is aniterative procedure in the space of deterministic policies;it discovers the optimal policy by generating a sequenceof monotonically improving policies. A policy-iterationalgorithm operates by alternating between two phases:policy evaluation computes the action-value functionQπt (s, a) of the current policy πt by solving theBellman equations (3), and policy improvement definesthe next improved greedy policy πt+1 over Qπt (s, a) as

πt+1(s) = arg maxa∈A

Qπt (s, a) (7)

The greedy policy πt+1 is a deterministic policywhich is at least as good as πt , if not better. Thisiteration process of two steps (policy evaluation andpolicy improvement) is repeated until there is no changebetween the policies πt and πt+1; the iteration hasthen converged to optimal policy. Surprisingly, theconvergence results within a small number of iterations.

The guaranteed convergence of policy iteration tothe optimal policy relies heavily upon tabular repre-sentations of the value function and each policy, andexact solution of the Bellman equations (3). Since mostreal-world applications have large or continuous statespaces, tabular representation is impractical. Approxi-mation in the policy-iteration framework is introducedas follows:

• The tabular representation of the value functionQπ(s, a) is replaced by the function approximatorQ̂π(s, a; w), where w are the adjustable parameters

of the approximator. Off-the-shelf solutions to thisapproximation problem are available.

• The tabular representation of the policy π(s) isreplaced by π̂(s; θ), where θ are the adjustableparameters of the representation. Off-the-shelf solu-tions are not available for policy approximationarchitecture, and the parameter adjustment mech-anism is to be integrated into the policy-iterationframework.

The solution of the Bellman equations (3) requiresthe model of the MDP. A model learning algorithmappears in the next section.

Model learning for policy iteration

We propose to learn a model of the MDP whileobtaining on-line experience, and then use this model tofacilitate policy iteration. In this on-line approach, theagent’s interactions with the environment are recorded,and a mapping is learned from these actual statetransitions. The learning procedure given below directlyfollows from Sharma and Gopal.[6]

Model learning algorithmStarting with an arbitrary policy, we create a trajectorythrough the state space as per the current policy πl (l isthe trajectory index), starting from the state-action pair(s0, πl (s0)), where s0 is the randomly chosen startingstate, and πl (s0) is the action specified at state s0 by πl .

At the start of each episode, we initialize thefollowing:

One-step rewards, Rπl = 0 (8a)

Transition probabilities, Pπl (s, a, s ′)

= 0 {MDP model parameters for policy πl }(8b)

Values, V πl (s, a) = 0 (9)

As we start the interaction process, state–action pairs(sk , ak ) are visited, and we update the system modelparameters and values as follows:

1. V πl updates: We use the Temporal Difference (TD)learning algorithm adapted to optimistic policy iter-ation: when the k th step in a trajectory l has beensimulated, i.e. from (sk , πl (sk )) to (sk+1, πl (sk+1)),we get a temporal difference

δk = Rπl (sk , πl (sk )) + γ V πl (sk+1, πl (sk+1))

− V πl (sk , πl (sk )) (10)

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 4: Reinforcement learning framework for adaptive control of nonlinear chemical processes

Asia-Pacific Journal of Chemical Engineering ADAPTIVE CONTROL OF NONLINEAR CHEMICAL PROCESSES 141

Then we iteratively update V πl values as follows

V πlk+1(sk , πl (sk )) ←−− V πl

k (sk , πl (sk ))

+ ψk (sk , πl (sk ))δk (11)

∀(sk , πl (sk )) pairs on the current trajectory, whereψk (sk , πl (sk )) = 1

nk (sk , πl (sk ))is the step-size coeffi-

cient for the pair (sk , πl (sk )), nk (sk , πl (sk )) = numberof visits to a (sk , πl (sk )) pair for the trajectory generatedas per the policy πl .2. One-step reward updates:

Rπlk+1(sk , πl (sk )) = Rπl

k (sk , πl (sk )) + ψk (sk , πl (sk ))

[R(sk , πl (sk )) − Rπlk (sk , πl (sk ))]

(12)

where R(sk , πl (sk )) = random one-step reward.3. Pπ updates: Let nπl

α be the number of visits toα = (sk , ak ) (where ak = πl (sk )), by the generatedtrajectory under policy πl , and nπl

αβ the number oftransitions from α = (sk , ak ) to β = (sk+1, ak+1);then the transition probability is defined as

Pπlαβ

∼= nπlαβ

/nπl

α∼= na

αβ

/na

α (13)

Next we show that the probability as defined above,reduces to that defined for a transition from state sk

to sk+1 under action ak = πl (sk ), if the policy is heldfixed. Under a fixed policy πl , transitions take place asfollows:

Under a fixed policy πl , we move from state sk to(sk , ak ) deterministically. Then the next state sk+1 isreached with a probability P(sk , ak , sk+1), and a rewardR(sk , ak ) is obtained. Now if the policy is kept fixed, wemove deterministically to (sk+1, ak+1). Thus for a fixedpolicy, sk and (sk , ak ) coincide, and similarly sk+1 and(sk+1, ak+1) are the same. We have

P(sk , ak , sk+1) ∼= naksk sk+1

/nak

skis the same as Pπl

αβ

∼= nπlαβ

/nπl

α (14)

where naksk sk+1

= number of transitions from sk to sk+1under ak = πl (sk ) and nak

sk= number of visits to state

sk under action ak = πl (sk )This empirically generated estimate (Eqn (14))

of the true relative frequency of transition will be-come more accurate with the generation of moretrajectories.

A trajectory generated as per a particular policywould provide us with the probability distributionfor only a single action in each state. We, there-fore, need to include different actions at each stateby using some random steps in the initial phase

of the procedure or use a pseudostochastic policy.This facilitates some amount of exploration in thesearch for an optimal policy. By intelligently explor-ing the action space, we can eventually generateenough information to obtain the true probability dis-tribution for each state–action pair in an iterativemanner.

Aggregating information from successivetrajectoriesWe combine the estimates produced by successivepolicies/trajectories by forming a weighted sum ofcontributions of each policy, as per the amount ofexperience provided by a policy at a state–action pair,i.e. let σ and ξ be two policies and a(σ ), a(ξ) bethe action a as applied within the policies σ and ξ ,respectively. Then

Pa(σ )αβ

∼= nσαβ

/nσ

α and Pa(ξ)αβ

∼= nξαβ

/nξ

α Pa(aggregated)

αβ

= nσαβ + nξ

αβ

nσα + nξ

α

(15)

The data from the current trajectory is combined withdata from all previous trajectories. By forming such aweighted sum, we would be able to capture the systembehavior to a greater extent. This combined informationis used in the policy update step to generate an improvedpolicy.

Policy update stepIn our approach, we generate V πl (sk , πl (sk )) valuesfrom a created trajectory for visited (sk , πl (sk )) pairs.We then form a weighted sum of these samples fromsuccessive trajectories to generate V (sk , πl (sk )) values,which are policy independent. We then perform apolicy update based on these V (sk , πl (sk )) values.This, policy update does not solely depend on datagenerated by the policy being followed during thecurrent trajectory.

Q(sk , ak ) =∑sk+1

P(sk , ak , sk+1)[R(sk , ak , sk+1)

+ γ V (sk+1, ak+1)] (16)

This value is calculated for all the state–action pairs(sk , ak ) that are visited during the execution of currenttrajectory as per πl . Finally, the policy is updated forall the visited states as follows:

πt+1(sk ) = arg maxak ∈A(sk )

Q(sk , ak ) (17)

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 5: Reinforcement learning framework for adaptive control of nonlinear chemical processes

142 H. SHAH AND M. GOPAL Asia-Pacific Journal of Chemical Engineering

Plant

ActorPolicy improvement

CriticPolicy evaluation

Cost to gogenerator

Modelgenerator

R

s

a

Vp

V

Pp,Rp

P,R

Figure 1. Policy iteration and model learning.

The model learning algorithm is summarized asfollows (Fig. 1):

1. Define a model for the problem with discretizedstates.

2. Choose a starting policy (may be random or basedon some heuristics).

3. Choose an arbitrary starting state, and run theplant/system under control of the current policy,generating a trajectory through state space. The tra-jectory provides estimates of transition probabilities,one-step reward, and values for the state–actionpairs visited.

4. Form an approximated MDP model by suitablyincorporating the new information obtained from thistrajectory with the earlier information from previoustrajectories.

5. Update the policy on the basis of the currentapproximated MDP model parameters and valueestimates.

6. If the policy update is optimal (as per the stoppingcriteria), terminate; otherwise, go to step 3.

MODEL AND DATA FOR SIMULATION

The case study on highly nonlinear simulated CSTRwas selected to test the performance of the RL frame-work for adaptive nonlinear control of a chemicalprocess.

The CSTR consists of an irreversible exothermicreaction, A → B, in a constant volume reactor cooledby a single coolant stream. This inherently nonlinearcase study requires adaptive controllers to cope withits time-varying plant dynamics. The plant consideredhere is a single-input–single-output (SISO) one, wherethe output is the product concentration, C, the inputbeing the coolant flow rate, qc. The chemical reactionthat produces the compound takes place inside theinsulated tank. This is an exothermic process thatraises the temperature inside the tank and reduces thereaction rate. The objective is to control the measured

concentration of the product, C, by manipulating thecoolant flow rate, qc.

Modeling the CSTR

The dynamic equations of the system are as follows:

dT

dt= qif

V(Tif − T ) + K1C · exp

( −E

R · T

)

+ K2qc ·[

1 − exp

(−K3

qc

)]· (Tcf − T )

(18)

dC

dt= qif

V(Cif − C ) − K0C · exp

( −E

R · T

)(19)

These equations were simulated by the Euler methodwith a time step of 0.1 s. We assume that, the con-straints on the variables are |C | ≤ 0.2 mol/L and |qc| ≤20 L/min. The plant constants, which include the heatof reaction, specific heats, liquid densities, and heattransfer terms in simplified forms, are summarized inTable 1.

CSTR control

It has been shown that an optimized ProportionalIntegral (PI) controller is able to control the CSTR forcertain operating regions.[7] To use this PI controller ina simulation study, the equivalent discrete-time form isrequired. This is given as follows:

u(t) = Kpe(t) + KiTs

n∑t=1

e(t) (20)

where Ts is the sampling time.

Table 1. The CSTR process parameters.[7]

Variables Description Nominal values

qif Product flow rate 100 L/minCif Input product

concentration1 mol/L

Tif Input temperature 350 KTcf Coolant

temperature350 K

V Container volume 100 LE/R Activation energy

term104 K

K0 Reaction rateconstant

7.2 × 1010/ min

K1 Plant constant 1.44 × 1013 K L/min/molK2 Plant constant 0.01/LK3 Plant constant 700 L/min

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 6: Reinforcement learning framework for adaptive control of nonlinear chemical processes

Asia-Pacific Journal of Chemical Engineering ADAPTIVE CONTROL OF NONLINEAR CHEMICAL PROCESSES 143

The objective of the controller proposed in this articleis to improve the robustness of the system to a levelhigher than that provided by the conventional PI controlscheme. Since an approximate model of the process isavailable, we first design a fixed controller based on thePI control scheme. The PI-controlled process becomesthe environment (plant) for RL. The RL supplementsthe conventional PI control scheme and the compositecontrol structure results in an adaptive control thatimproves the robustness.

Data for simulation

In CSTR process control problem, the objective isto control the measured concentration of the prod-uct, C , by manipulating the coolant flow rate, qc.The CSTR system has two state variables, namely,the reactor temperature and the reactor product con-centration. We define error as the difference betweendesired and actual value of product concentration,i.e. e(t) = Cd − C , where Cd is the desired value ofthe product concentration. The reinforcement signal isgiven as

R ={−1 otherwise

10 |e(t)| < 0.001 mol/L0 0.001 < |e(t)| < 0.005 mol/L

(21)

The parameters for the RL controller are as follows:the discount factor γ is set to 0.8; learning-rate param-eter α is set to 0.2; and exploration level ε decays from0.5 → 0.002 over the iterations. We consider a set of PIparameters: KP = 440, Ki = 550, having sampling timeTs = 0.8 s.

SIMULATION RESULTS AND DISCUSSIONS

The aim of this control problem is to study theperformance of the adaptive controller based on twoalgorithms, namely, Q-learning and policy iteration inRL framework such that the controller can control theconcentration of product with a steady-state error assmall as possible. In order to do so, the CSTR systemmodel has been simulated for a single episode of 80 susing the Euler method, with a fixed time step of10 ms. MATLAB 7.4.0 (R2007a) has been used as thesimulation tool.

The dynamic behavior of the CSTR process is notthe same at different operating points, and the processis, indeed, nonlinear. Eigenvalue analysis shows thatthe stable equilibrium regime of the CSTR lies in C ∈(0, 0.13566) mol/L and qc ∈ (0, 110.8) L/min.[8] We setthe desired value of product concentration 0.12 mol/L,

and start the simulation with one of the five localoperating regimes:

C 1 = 1.2980e − 1 mol/L, T 1 = 432.95 K, q1c

= 1.10e2 L/ min

C 2 = 8.5060e − 2 mol/L, T 2 = 442 K, q2c

= 9.8899e1 L/ min

C 3 = 5.8541e − 2 mol/L, T 3 = 450 K, q3c

= 8.8291e1 L/ min

C 4 = 2.9468e − 2 mol/L, T 4 = 465 K, q4c

= 6.8788e1 L/ min

C 5 = 1.4630e − 2 mol/L, T 5 = 481 K, q5c

= 5.0438e1 L/ min (22)

The learning performance of the controller as theproduct concentration vs simulation sample points forthe single episode is shown in Fig. 2 and Table 2tabulates the number of sample points at which theprocess response settles to the desired value of productconcentration.

From the results (Fig. 2, Table 2), we observe thatthe response of the PI with the policy learning algorithmsettles to the reference faster than other algorithms. Theleast value of sample points at the settling of processimplies a shorter experiment time.

0 10 20 30 40 50 60 70

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Sample Points

Prod

uct C

once

ntra

tion

PIPI + Q-learningPI + Policy iteration

Figure 2. Comparison of controller performances. Thisfigure is available in colour online at www.apjChemEng.com.

Table 2. Performance comparison.

Controller algorithmNumber of samples for settling

of process response

PI 46PI + Q-learning 67PI + Policy iteration 35

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 7: Reinforcement learning framework for adaptive control of nonlinear chemical processes

144 H. SHAH AND M. GOPAL Asia-Pacific Journal of Chemical Engineering

70 80 90 100 110 120 1302

3

4

5

6

7

8

9x 10-5

Poduct Flow rate (l/min)

mea

n sq

uare

err

or

PIPI+ Q-learningPI+ Policy iteration

Figure 3. MSE vs change in product flow rate (qif).This figure is available in colour online at www.apjChemEng.com.

450 500 550 6000.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Sample points

Prod

uct c

once

ntra

tion

PIPI+ Q-learningPI+ Policy iteration

Figure 4. Comparison of controller performances (withdisturbance in qif). This figure is available in colour online atwww.apjChemEng.com.

In the following, we compare the performance ofall three controller algorithms under uncertainties. Forthis study, we trained the controller for 20 episodesat steady-state operating data with one of the localoperating points.

1. Disturbance in product flow rate (qif): We considera variation in input product flow rate from 70 L/minto 130 L/min (nominal value 100 L/min), and makeit occur at sample point 500. Figure 3 shows themean square error vs change in product flow rate, forPI/PI + Q-learning/PI + policy-iteration controlleralgorithm, and the comparison of controller learningperformances is shown in Fig. 4. Table 3 tabulatesthe values of mean square error for ±20% variationwith respect to the nominal value of product flowrate.

2. Disturbance in input product concentration (Cif): Weconsider a variation in input product concentrationfrom 0.5 to 1.5 mol/L (nominal value 1 mol/L), andmake it to occur at sample point 500. Figures 5

Table 3. MSE comparison of controller algorithm(mol/L) (MSE: values shown ×10−5).

Controlleralgorithm

qif 80(L/min)

qif 100(L/min)

qif120

(L/min)

PI 5.187 3.266 4.984PI + Q-learning 4.620 2.742 5.388PI + Policy iteration 5.372 2.742 4.523

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10-4

Input product concentration (mol/l)

mea

n sq

uare

err

or

PIPI+ Q-learningPI+ Policy iteration

Figure 5. MSE vs change in input product concentra-tion (Cif). This figure is available in colour online atwww.apjChemEng.com.

450 500 550 6000.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

Sample points

Prod

uct c

once

ntra

tion

PIPI + Q-learningPI + Policy iteration

Figure 6. Comparison of controller performances (withdisturbance in Cif). This figure is available in colour online atwww.apjChemEng.com.

and 6 show the mean square error vs change inproduct concentration, and controller learning per-formances, respectively, for PI/PI + Q-learning/PI+ policy-iteration controller algorithm. Table 4 tab-ulates the value of mean square error for ±40%variation with respect to the nominal value of prod-uct concentration.

3. Disturbance in input temperature (Tif): We considera variation in input temperature from 300 to 400 K(nominal value 350 K), and make it to occur at

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 8: Reinforcement learning framework for adaptive control of nonlinear chemical processes

Asia-Pacific Journal of Chemical Engineering ADAPTIVE CONTROL OF NONLINEAR CHEMICAL PROCESSES 145

Table 4. MSE comparison of controller algorithm(mol/L) (MSE: values shown ×10−5).

Controller algorithmCif 0.6(mol/L)

Cif 1.0(mol/L)

Cif 1.4(mol/L)

PI 11.781 3.108 6.041PI + Q-learning 9.725 3.103 6.246PI + Policy iteration 8.777 2.742 5.295

300 310 320 330 340 350 360 370 380 390 4000

0.5

1

1.5x 10-3

Input temperature (Deg K)

mea

n sq

uare

err

or

PIPI + Q-learningPI + Policy iteration

Figure 7. MSE vs change in input temperature (Tif).This figure is available in colour online at www.apjChemEng.com.

450 500 550 6000.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Sample points

Prod

ucto

n ce

ntra

tion

PIPI + Q-learningPI + Policy learning

Figure 8. Comparison of controller performances (withdisturbance in Tif). This figure is available in colour online atwww.apjChemEng.com.

sample point 500. Figures 7 and 8 show the meansquare error vs change in input temperature, con-troller learning performances, respectively, for PI/PI+ Q-learning/PI + policy-iteration controller algo-rithm. Table 5 tabulates the mean square error for±30% temperature variation with respect to nominalvalue.

Simulation results (Figs 3–8, Tables 3–5) indicatethat, in the presence of disturbances around the nom-inal operating conditions, the PI with policy learningcontroller algorithm controls the concentration of the

Table 5. MSE comparison of controller algorithm(mol/L) (MSE: values shown ×10−5).

Controller algorithm Tif 320 K Tif 350 K Tif 380 K

PI 22.44 3.111 11.19PI + Q-learning 30.08 3.111 7.639PI + Policy iteration 24.28 2.742 10.09

product with low value of steady-state error in compar-ison with PI with Q-learning, and PI controller. In mostcases, the controllers using RL showed good robustnessagainst disturbances.

CONCLUSIONS

This article has critically examined the RL frameworkfor the control of chemical processes. In particular,Q-learning algorithm and policy-iteration algorithmhave been evaluated for control of highly nonlin-ear simulated CSTR. This inherently nonlinear casestudy demonstrated that (1) the policy-iteration algo-rithm converges very fast compared to Q-learning; thesettling time in adaptive control based on policy itera-tion is, therefore, relatively smaller; and (2) the robust-ness achieved through policy iteration is comparableto/better than that achieved through Q-learning.

Since settling time and robustness are very importantissues in control, policy iteration is a potential candidatefor adaptive control of nonlinear chemical processes.

NOMENCLATURE

S State spaceA Action spaceD Initial state distributionR Single step rewardπ Deterministic policyπ∗ Optimal policyP State transition probabilityV π(s) State-value functionQπ(s, a) Action-value functionγ Discount factor in the range of [0,1)α Learning rate in the range of [0,1]δ Temporal differencenπ

αβ Number of transition from α to β, by thegenerated trajectory under policy π

REFERENCES

[1] M. Morari, J.H. Lee. Comput. Chem. Eng., 1999; 23, 667–682.[2] R. Sutton, A. Barto. An Introduction to Reinforcement Learning,

MIT Press: Cambridge, Massachusetts, 1998.

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj

Page 9: Reinforcement learning framework for adaptive control of nonlinear chemical processes

146 H. SHAH AND M. GOPAL Asia-Pacific Journal of Chemical Engineering

[3] M.L. Puterman. Markov Decision Processes: Discrete Stochas-tic Dynamic Programming, John Wiley & Sons, Inc: New York,1994.

[4] C.J.C.H. Watkins. Learning from Delayed Rewards, PhDDissertation, University of Cambridge, Cambridge, England,1989.

[5] R. Howard. Dynamic Programming and Markov Processes,MIT Press: Cambridge, 1960.

[6] R. Sharma, M. Gopal. Int. J. Comput. Intell., 2006; 3, 169–178.

[7] J. Govindhasamy, S.F. McLoone, G.W. Irwin. Reinforcementlearning for process identification, control and optimisation.IEEE International Conference on Intelligent Systems , 2004;pp.316–321.

[8] R. Gao, A. O’dywer, E. Coyle. A nonlinear PID control forCSTR using local model networks. In Proceedings of 4thWorld Congress on Intelligent Control and Automation, 2002;pp.3278–3282.

2010 Curtin University of Technology and John Wiley & Sons, Ltd. Asia-Pac. J. Chem. Eng. 2011; 6: 138–146DOI: 10.1002/apj