A SIMULATOR-BASED REINFORCEMENT-LEARNING PARADIGM€¦ · Modified gradient learning methods and...

MODELING AND CONTROL OF DISCRETE EVENT DYNAMIC SYSTEMS:A SIMULATOR-BASED REINFORCEMENT-LEARNING PARADIGM

PAOLO DADONEHUGH VANLANDINGHAM

The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 340 Whittemore HallBlacksburg, Virginia 24061 - 0111 U.S.A.

BRUNO MAIONEDipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, Via E. Orabona 4

Bari, 70125, Italy

A general reinforcement-learning approach for controlling discrete event systems is presented. A machine-repair example is formulated: (1)to describe and explain the DEVS formulation, and (2) to illustrate the general control method. Modified gradient learning methods andevolutionary programming methods are compared for the purpose of optimizing the controller. An on-line adaptation method is presented;and, the use of fuzzy logic and artificial neural networks for such adaptation is compared. Evolutionary programming methods for controlleroptimization prove to be the most robust type of optimization. Moreover, the fuzzy and neural adaptation approaches are successful inimproving the performance of the static controller for dynamic operating conditions.

Keywords: Intelligent control; DEDS; DEVS; Reinforcement-learning; Evolutionary programming; Fuzzy logic; Artificial neural networks.

1. Introduction

Classical dynamic system models, i.e. differential equation based models, have developed in large part to describe naturalphysical systems. In contrast, the widespread deployment of man-made systems, such as manufacturing systems of varioustypes, has given rise to more general models which are called discrete event dynamic system (DEDS) models. DEDS canbe used to model the “event-driven” systems common to man-made systems as well as subsystems which are “time-driven” natural processes, such as temperature decay and the dynamics and kinematics of mass motion, making DEDS amore powerful paradigm.1 In response to industrial demand for DEDS models, several simulation packages have beendeveloped; and their use has greatly helped the understanding of these systems and the development of control policies forthem. A discrete event system specification, called DEVS formulation, was instrumental in the development of suchsoftware. DEVS is a system theoretic method for modeling DEDS which provides a descriptive framework for suchsystems.2,3,4,5 Both the complexity and the stochastic nature of discrete event systems, along with the non-homogeneity ofrelations, contribute greatly to their analytical intractability. It is for this reason that we assume the availability of DEDSsimulation software as a means of representing the plant, i.e. the system to be controlled. Suitable control policies forDEDS are generally difficult to derive using classical control techniques. The rise of “intelligent” methods, such as fuzzylogic system (FLS) and artificial neural network (ANN) control has helped to some extent; however, a truly optimizingcontroller for DEDS continues to be a challenging area of research.

The mathematical foundation of optimal control for general systems, including DEDS, has been available for severaldecades in the form of Bellmann's dynamic programming (DP) algorithm.6,7 In essence, DP seeks to find a control policythat minimizes a total average cost functional which, in turn, can be partitioned into the cost of the current control actionand the remaining "cost-to-goal". The basic drawback of DP is that for significant problems the search space expandsexponentially. As an alternative to the exact calculation of the optimal control policy, one might calculate a limited look-ahead policy with cost-to-goal generated by an approximating function. If, for example, an artificial neural network is usedas the approximating function, it can be included as part of a performance evaluation for selecting the proper controlaction. In addition, its parameters can be updated on-line as information about the system is collected. Figure 1 illustratesthe generic control system structure. The difficulty in using the reinforcement-learning concept is in finding theapproximating function for the cost-to-goal. To cope with this problem, it will be assumed, even though no analyticalmodel of the plant is available, that a plant simulator is accessible. (This assumption is reasonable since there is wideavailability of specialized simulation programs for different types of systems.)

The approximation problem is the key to using the DP approach; however, here the problem can take several differentpaths. First of all, there are many approximating paradigms, including ANNs and FLSs. Secondly, there are manyapproaches to approximating the complex system. For example, if the state-input (configuration) space can be partitioned

appropriately, then some type of modularapproach may be taken, where "simple"modules, used to approximate the functionin different regions, can be combined toform the overall approximation.8 Up to thispoint, this approach could be calledheuristic dynamic programming (HDP).7

The option of improving the cost-to-goalapproximation on-line involves the"reinforcement-learning" (RL) concept.Referring to Fig. 1, this function isimplicitly performed in the block entitled"Performance Criterion" (PC).

A direct approach can be used todetermine improvements in the control action. With this method the configuration space is "explored" by modifying thecontrol actions and observing performance changes. This type of trial-and-error learning has been called reinforcement-learning because of its similarity to animal behavior modification used by psychologists. Assuming that the controller is aparameterized structure, such as a neural network, the exploration of the parameter space can be performed either on-linewith the actual system, or off-line with the use of a simulator.

The on-line mode of adaptation amounts to a judicious perturbation of the controller parameters and noting the changein the PC. Given enough time to "gently" explore the space, the PC can be modified to include some "hard limit" termsthat will prevent instability. One might think of this more general PC as a heuristic for the cost-to-goal in the DPinterpretation of determining the optimal control. The off-line mode can be used effectively if the plant model is accurate.In this case many "trials" can be run as a practice exercise for applying the next control input. This off-line search is oneway in which the judicious choice of perturbations can be made.

In this paper a simple two-parameter control policy is formulated for the example of a shop running 50-machines whichfail probabilistically. It is the authors' opinion that a specific example is beneficial for understanding of both the DEVSmodeling process and the details of the adaptive control methods. The control algorithm is based on a performance indexthat is a measure of the shop profit. The DEVS formulation for the machine-repair problem is presented in Section 2. Apreliminary discussion about the characteristics of the problem is given in Section 3. The reinforcement learning paradigmthat uses gradient ascent methods and evolutionary programming for the design phase, and fuzzy and neural adaptation forthe on-line operation is presented in Section 4 and then applied in Section 5, where simulation results are given.Concluding comments are given in Section 6.

2. DEVS Formulation of the Machine Repair Plant

2.1. Informal description of the plant

It will be assumed that there are N-machines working in a plant which aresubject to individual failure according torandom failure times. When a failureoccurs, a machine can change state fromcorrect-functioning (state 1) to "minor-repair-needed" (state 2), up to a completedisfunctioning ("non-operable" state F).This change of state occurs at failuretimes according to some state transitionprobabilities as shown in Fig. 2. At afailure time a machine in state i willeither stay in the same functioning state Figure 2. Machine Repair Example

Plant

Disturbance

AdaptiveController

PerformanceCriterion

Reference Response

1 2 3 F

Repair Queue

t

t

t

t

t

1F

13 2F

12

11

. . .

c(2) c(3) c(F)

η(1) η(F)η(3)η(2)

Figure 1. Reinforcement Learning Control

with probability tii or it will transition to a worse functioning state j (j > i) with probability tij. When a machine is in state i,it produces a profit of η(i)P $/time unit, where η(i) is the efficiency of the machine in state i (0≤η(i)≤ 1) and P is theprofit due to a correct functioning machine (therefore, η(1)=1). A machine's efficiency corresponds to the fact that whenit is not working properly, it consumes more fuel, produces pieces with less quality and even takes more time; all of whichimpacts the system profit. When a machine is in a state i≠1 (i.e., a failure has occurred), it is eligible to be sent to repair.If so, the machine returns to state 1 (Fig. 2), at a cost of c(i) (per unit time), having kept one of the M repair teams busy fora random repair time. If all M repair teams are busy, the machine will wait in a FIFO (first in first out) queue. Sending amachine to repair will also have another (indirect) cost, i.e. since the machine is down, the plant is not profiting from thatmachine. When a machine is repaired, it goes back to state 1 and starts to work, as explained before. In this model thecontrol action is triggered by the failure of a machine; and, based on the actual value of the output of the system (theinstantaneous profit that the plant is making), decides whether to send some machines to repair, or not.

2.2. DEVS formulation of the plant

Our system is a discrete event system, indeed state transitions are due to the occurrence of three types of events:• e1: a possible failure of a machine;• e2: the end of repair of a machine;• e3: sending n machines to repair.The first two events are internal events. The last one is an external event, and corresponds to the actual input to thesystem. Using a system theoretic (DEVS) formalism the system can be described as S = < X, Σ, Y, δ, λ, ta >. In thefollowing there is a description of each of the six components of the system, S.

The input set (X): The input corresponds to the last event (e3: send n machines to repair), thus it is completely specified bythe integer number n. It can be any integer between 0 (send no machine to repair, i.e., the non-event) and N (send all themachines to repair). Therefore:

X = 0, 1, 2, ..., N (2.1)

The sequential state set (Σ): The sequential state is comprised of four "macro" logic components. The first of the fourcomponents is the SERVICE LIST, i.e., the list of machines that are actually working. This is a FIFO list but, as we will see,its order is not important. Every record in this list has three components: (1) the identification number (ID) of themachine, (2) the service time left (σ) for the machine and (3) the current functioning state (s) of the machine; therefore,the ith record can be written as (ID i

s, σi

s, si

s). The list of working machines has I records, where I∈N and I ≤ N. The

second of the four components is the REPAIR QUEUE, the queue of machines waiting for a free repair team. This is a FIFOqueue. Every record in this queue has two components: (1) the ID of the machine and (2) its functioning state. Therefore,the jth record in the queue will be (ID j

q, sj

q). This queue has J records, where J∈N and J ≤ N. The third of the four

components is the ON-REPAIR LIST, the list of machines that are currently under repair. This is a FIFO list, but in this case,the order is not important. Every record in this list has three components: (1) the ID of the machine, (2) the repair timeleft for the machine and (3) the functioning state the machine was in before starting the repair; therefore, the kth recordwill be denoted (IDk

r, σk

r, sk

r). This list has K records, where K∈N and K ≤ M (M is the number of repair teams). The

fourth and last component of the state is the number of free repair teams (FRT). This can be any integer between 0 (all therepair teams are busy) and M (all the repair teams are idle). The machine IDs can be any integer between 1 and N, and theσs are positive (real) numbers. The functioning state of a machine can be any integer between 1 (correct functioning) andF (complete disfunctioning), where F is the number of functioning states of the machines.The state s will look like:

( ) ( ) ( )( ) ( ) ( )

( ) ( ) ( )

s,,IDs,IDs,,ID

.........

FRT, s,,ID, s,ID, s,,ID

, s,,ID, s,ID, s,,ID

rK

rK

rK

qJ

qJ

sI

sI

sI

r2

r2

r2

q2

q2

s2

s2

s2

r1

r1

r1

q1

q1

s1

s1

s1

σσ

σσ

σσ

=s

In the following: NT = 1,2, …, T ∀T∈N, NT,o = NT ∪ 0 and A* is the set of all the possible lists made of elements ofA. If we define L = NN × R+ × NF, then we have that the SEQUENTIAL STATE SET is:

Σ = L* × NN × NF* × L* × NM,o (2.2)

The output set (Y): In this system the output will be a number indicating the instantaneous profit the plant is making;therefore, it can be any real number, with negative numbers indicating loss.

Y = R (2.3)

The output function (λ): As explained in the informal model, the output (i.e., the plant profit) depends on the functioningstate of all the working machines less the cost incurred for machines being repaired. Therefore, we need to define twofunctions that will be a part of the output function. The first is an efficiency function, η : 1, 2,..., N → [0, 1]; where η(i)is the efficiency of a machine in state i. The second function is a cost to repair function, c : 2,..., F → R+; where c(i) isthe cost (per time unit) to repair a machine that is in state i (c is obviously defined for i > 1 since there is no sense inrepairing a correctly functioning machine, state 1). Thus, we can define the output function as:

( ) ( ) ( )∑∑==

−⋅η=λ∋→ΣλK

1k

rk

I

1i

si scPs Y : s (2.4)

The time advance function (ta): The time advance function gives us the next hatching time, that is, the next internal eventoccurrence time in the absence of external events. If we call s

minσ the minimum of siσ over all i, and σ min

r the minimumof σ k

r over all k, then the time advance function is given by:

ta : Σ → R+

( )

σ

σ≠σσσ=

therwiseo

if ,minta

rmin

rmin

smin

rmin

smin

s (2.5)

If ta(s) = σmins , this means that the next event is a possible machine failure. If ta(s) = σmin

r , this means that the nextevent is an end-of-repair event. The second line of the definition of the time advance function in Eq. (2.5) has thefollowing implicit "tie-breaking rule": If a possible failure event and an end-of-repair event occur together, giveprecedence to the end-of-repair. Another possible situation is that σmin

s corresponds to two elements in the service list.In this case the precedence will be given to the first record. The same will happen for the on-repair list. This is the onlycase where the FIFO nature of the lists affects our model.The transition-specifying function (δ): Characterizing the transition specifying function is the most difficult job in a DEVSmodel. The main reason is that in most cases there is no closed form for this function, in which case it must be definedwith tables, algorithms, etc. For present purposes we will define this function with a sort of "algorithm" for every possiblestate transition.

This function is divided into two parts, the internal transition function (δΦ) and the external transition function (δex).The former specifies the behavior of the system after an internal event occurs, while the latter specifies the behavior of thesystem after an external event. The internal transition function is:

δΦ : Σ → Σ, ∋ s+ = δΦ(s-) (2.6)

An internal transition can take place if event e1 or e2 happen. If the event is e1 (possible failure of a machine), this meansthat ∃ m ∈ NI ∋ s

minsm )(ta σ==σ s . In this case the actions to take are:

(1) Determine the new functioning state of machine IDm ( +sms ) according to a given state transition matrix T, i.e., tij is the

probability of a transition from state i to state j. More precisely, the new functioning state for machine IDm, presentlyin state −s

ms =i, will be sampled from the following discrete probability function: (1, ti1), (2, ti2), …, (F, tiF). (In ourcase the matrix T, whose rows sum to one - since each of them represent a discrete probability function, is uppertriangular, since every machine can only transition to a worse functioning state.);

(2) Sample a new service time from the corresponding distribution;(3) If −+ ≠ s

msm ss (a failure occurred), send a signal to the controller.

If the event that takes place is e2 (end-of-repair of a machine), this means that ∃ m ∈ NK ∋ ( ) rmin

rm ta σ==σ s . In this case

the actions to take are:(1) Remove the mth record from the on-repair list (i.e., the machine just finished being repaired) and add it at the end of

the service list. (From now on, for sake of simplicity, we will assume the basic operations of remove and add to a listare understood);

(2) Sample a service time from the appropriate service time distribution;

(3) Set the functioning state of the machine to correct functioning: 1ssI =+ ;

(4) If the repair queue is empty: FRT=FRT+1; else remove the first record from the repair queue and add it at the end ofthe on-repair list and sample a repair time using the appropriate repair time distribution.

This completely specifies the internal transition function. The external transition function is:

δex : Q × Ω → Σ , ∋ s+ = δex(s-, e, ω) (2.7)

where, Q = (s,e) | s∈ΣΣ ∧ 0 ≤ e ≤ ta(s) is the state set, and Ω is the input segment set, i.e., a subset of the set of all theinput segments. If at e time units since the last event there is an external event (e3) of value n (i.e., send n machines to berepaired) the actions to take are:(1) Evaluate the number of machines in the service list having I,...,2,1i 1ss

i ∈∀≠− , and, if there are b such machines,let ν = min n, b;

(2) Take the ν "worst" machines in the service list and send them to repair. This means that we must remove the entriescorresponding to these machines from the service list and add them at the end of the on-repair list or of the on-repairqueue, depending on the availability of repair teams. More precisely, for each of the ν machines on the service list wehave to:(a) Remove the corresponding record (call it pth record) from the service list;(b) If FRT≠0: add the pth record to the on-repair list, sample a repair time from the appropriate distribution and

decrement FRT; else: add the pth record at the end of the repair queue.The state transition specifying function is now completely defined; completing the DEVS formulation.

2.3. Controller

The controller is a relatively simple one. Namely, when there is a failure of a machine in the system, a control action isrequired. The controller takes, as input, the output of the system and determines whether to send R machines to repair, ornot, (the control action) applying the following control rule: If input is less then w⋅P⋅N, then send R machines to repair.

This controller is defined by the two parameters w and R. The w parameter can be any real number in [0,1] and the Rparameter can be any non-negative real number. If w = 0 this means that the “threshold” for the input is zero, i.e., thecontroller will never send a machine to repair. The opposite extreme corresponds to a value of w = 1 which means that thethreshold is P⋅N, i.e., the maximum profit that the plant can realize having all the machines working at the same time in acorrect functioning state. In this case, at every failure R machines will be sent to repair. A value of R = 0 means that wenever send a machine to be repaired. A value of R that is not an integer has a special meaning. It means that we aresending to repair as many machines as the integer part of R (IR) plus one more machine with a probability given by thedecimal part of R (DR). Therefore, we send IR machines to repair and decide whether to send another machine or not,according to the “probability” DR. Remembering what is written for the δex function, in our case the response of thecontroller is immediate, therefore e = 0. This means that immediately after a failure we have a send-to-repair action.

3. Problem Formulation

In our simulations the following parameters were chosen:• N = 50 machines, to establish a reasonably sized plant;• P = 10 $/time unit;• M = 10 repair teams, in order not to have too much of a constraint on R;• F = 5 functioning states for each machine;• η(1) = 1, η(2) = 0.85, η(3) = 0.6, η(4) = 0.25, η(5) = 0; (efficiency function);• c(2) = 30 $, c(3) = 45 $, c(4) = 60 $, c(5) = 90 $; (cost to repair function);• Service times are exponentially distributed with a mean time between failures (MTBF) of 5 time units;• Repair times are exponentially distributed with a mean time to repair (MTTR) of 1 time unit;• Functioning state transition probability matrix (T):

=

10000

95.005.0000

15.08.005.000

05.01.08.005.00

005.0045.01.08.005.0

T

The estimated expected profit per unit time is used as a performance index (PI) for the control actions. The estimate iscomputed by averaging (over time) on one simulation 1000-time units long. Transients are excluded from the averagingprocess by deleting the observations corresponding to the first 100-time units. Through an extensive number ofsimulations the behavior of the PI as a function of w and R has been determined and then plotted in Fig. 3. The followingobservations can be made:(1) The response surface is “corrugated” due to the stochastic nature of the system; even increasing the length of the

simulation or using several replications does not show a significant improvement.(2) There is a strong discontinuity at R = 1 for every value of w∈[0,1]. This means that if the repair is a probabilistic

action (instead of deterministic as for R ≥ 1), the PI is strongly affected; that is, the system really needs the repairactions.

(3) There is a global maximum of approximately 200 for R=1 and w between 0.6 and 0.8.(4) There are an infinite number of local minima for w=1 and every R. These local minima also have the same value

(PI=119.6). The presence of these local minima is easily explained, indeed w=1 means that the threshold for therepair action is the maximum profit we can ever make with the plant, that is, we will send a machine to repair everytime there is a failure. This means that as soon as a machine breaks, we send it to repair. Therefore, there will alwaysbe no more than one machine broken at a time and this creates the insensitivity of the minima to the R parameter.

The preceding considerations show that, although a relatively simple problem, determination of the optimal (or even a"good" sub-optimal) controller for this system is not simple.

4. The Reinforcement Learning Paradigm

The reinforcement learning paradigm consists of an off-line (design) phase, and an on-line (adaptation) phase. Thedesign phase consists in finding, by means of simulation, some values for the parameters of a given parametric controllerthat optimize some performance index. Therefore, this phase can be regarded as a simulation response optimizationproblem where several approaches can be taken.9,10 Even though the field of simulation optimization is quite mature, thereis no general “good” algorithmfor simulation optimization, butthere are several techniqueswhich give some good results.9

Therefore, there is no fixedmethod for the controller design,but a set of methods to chooseby trial-and-error. This makesthe design phase, even thoughapparently simple, a task that isfar from simple. In the followingwe will use path search methods,that is, we will try to find theoptimal parameters for thecontroller by employing a pathsearch in the direction given byan estimate of the gradient of thePI with respect to the controllerparameters. An estimate of this

Figure 3. Response Surface

00.2

0.40.6

0.81

0

2

4

6-200

-100

0

100

200

wR

PI

gradient will be determined using finite differences or response surface methodologies and some modifications thereof.Finally, we will also employ a random search algorithm, namely, evolutionary algorithms, that is the most computationallyexpensive, but also the most robust and reliable. The adaptation phase (on-line reinforcement learning) is a verycomplicated task. In the following we will approach the problem with an “exploration” of the optimal solutions. Thisapproach was developed and applied by the authors for the on-line adaptation of an inventory system policy and proves tobe very promising.

4.1. Off-line (design) phase

The off-line reinforcement learning is a simulation optimization problem that we will first approach with path searchmethods. If xk is the kth iteration point of a path-search algorithm in a p-dimensional search space the next point will begiven by:

xk+1 = xk + η dk (4.1)

where η is called the learning coefficient and dk is the search direction at iteration k and is given by:

dk = ∇xPI(xk) + α dk-1 (4.2)

where α is called the momentum coefficient and ∇xPI(xk) is the (estimated) gradient of the PI with respect to x evaluatedat xk. The terminology used above is commonly used in the field of neural networks, while in mathematical programmingterms (4.1) and (4.2) imply a conjugate gradient method with step length η and deflection parameter α.8,11 (Even thoughthe deflection parameter will be fixed arbitrarily and not determined in the usual conjugate gradient ways.) The algorithmwill generate new points in the search space and will finally stop when a stopping condition (such as small relative changein the PI or small gradient etc.) is met.

The critical term in (4.2) is the (estimated) gradient. The gradient can be estimated using several approaches.9,10 In theforward finite differences (FFD) approach, the gradient is estimated by perturbing separately each parameter in the searchspace and observing the consequent change in the PI. In the particular case we are considering there are two searchparameters (w and R) and therefore the PI can be regarded as a function of these two parameters, PI = PI(w, R). Thus, theestimated gradient at a point (w0, R0) in the FFD approach will be given by:

( )

( )

( ) ( )

( ) ( )

δ−δ+

δ−δ+

≅

∂∂∂∂

=∇

R

R,wPIRR,wPI

w

R,wPIR,wwPI

R

PI

w

PI

R,wPI0000

0000

R,w

00x

00

(4.3)

where δw and δR are perturbations small enough to give an accurate estimate of the gradient and big enough not toconfuse the system stochastic component with the actual gradient. Note that the FFD approach requires three simulationsto estimate the gradient at each point (for a p-dimensional search space it requires p+1 simulations), and it is better to usecommon random numbers. We also propose a modification of the FFD approach that can be regarded as a cycliccoordinate forward finite different approach (CFFD). In this approach the derivatives of the PI with respect to the searchparameters are still estimated in the same way, but the actual points at which they are estimated change. Indeed we firstestimate the first component of the gradient, move accordingly and then estimate the second component of the gradient inthe new point and continue like this in a cyclic way. This method is a kind of hybrid between the FFD and the cycliccoordinate method.11

Another paradigm to estimate the gradient lies in the field of response surface methodologies (RSM). This approachfits a meta model to the data coming from the simulation and extracts the gradient according to this fit. The easiest way todo this is by using a linear meta model where the PI is a first order polynomial in the search space parameters. In our casethe meta model would be:

PI = β0 + β1w + β2 R + ε (4.4)

where ε is an error term that under certain assumptions is normally distributed with zero mean and finite variance.9,10

After running some simulations for different values of w and R, the corresponding PI is obtained, and the meta model of(4.4) is fitted to the data, thus obtaining some least square error estimates for the βs. It is easy to see that [β1, β2]

T is an

estimate of the gradient. Of course values for w and R cannot be arbitrarily chosen, but some kind of experimental designhas to be followed. The most common type of experimental design is a full factorial design, which is quite “reliable” for

gradient estimation, but unfortunately requires 2p simulations in a p-dimensional search space, generating an explosion ofcomplexity with increasing cardinality of the search space. In this case a gradient estimation using a full factorial designrequires 4 simulation runs, thus being feasible to use.

An alternate class of search methods is that of random search methods, e.g., genetic algorithms (GAs) and evolutionaryprogramming (EP). In evolutionary programming the algorithm starts with an initial (random) population of k individuals,i.e., points in the search space.12 Each individual will have a fitness, i.e., the value of the PI corresponding to that point -note that the fitness estimation requires one simulation. According to the fitness of each individual, a new generation willbe formed by: retaining the “healthier” individuals (elitism); generating “off-spring” of the old individuals (applying someperturbation to some “parents” chosen through some kind of probabilistic selection process); and creating some newrandom individuals. The algorithm will finally end after a certain number of generations or when there is some kind ofuniformity in the population. It is important to notice that EP is quite different from GAs in that it does not require anadditional binary encoding and mainly works on perturbations of parents to generate off-springs, rather than using cross-over operators.

4.2. On-line (adaptation) phase

The adaptation-through-exploration paradigm is strongly based on the reinforcement learning approach describedabove and depicted in Fig. 1. Given a DEDS, its operating conditions will generally have random variations, according tothe influence of the environment on the system and of changes of the system itself. However, we can reasonably identifythe possible causes for changes in operating conditions (perturbing elements), as well as their range of variation, which wecall the perturbing element space (PES), through an understanding of the environment-system interactions. Through asensitivity analysis of the performance of the system with respect to such perturbing elements, the most relevant ones canbe determined. Eventually their number will be limited, and their range can be bounded through a “common sense”,practical approach, thus leading to a subset of the critical perturbing element (CPE) space. Unfortunately, in mostpractical cases the CPE subset still has a large cardinality (since it represents the environment/system interaction);therefore, it is not feasible to completely “explore” it (with an off-line simulator based reinforcement learning approach) todetermine the optimal controller parameters to use in each operating conditions. However, a quick “exploration” can bedone by just considering some “well chosen” points, that is, simulating some significant changes due to the environment(or the plant). If this step is done properly, we should then be able to interpolate what happens for other critical points thatwere not explored. This process can be summarized as follows. First, the CPE subspace is explored (off-line) bystimulating selected changes in operating conditions and recording the consequent variation in the optimal controllerparameters. Second these data are routed to an adaptation module that learns and generalizes from them. Finally, thecontroller should be able to work on-line, adapting itself to random environmental changes by means of the adaptationmodule. In the following the CPE subset is defined for the machine repair example; and, both a FLS and an ANN are usedas adaptation modules, and their performances are compared.

5. Experimental Results

5.1. Off-line reinforcement learning

In the following the learning schemes described in 4.1 are implemented. As a stopping criterion for all the path searchalgorithms the relative change in the PI is used, i.e., the algorithm will stop when the PI has had a relative change of 0.1%or less. Such a percentage was used since it is useless to have a more stringent (smaller percentage) stopping criterion inthe case of a stochastic plant. Values of δw = 0.1 and δR = 0.5 and random initial points were used.

In Fig. 4 some learning curves are shown for different values of the learning coefficient and for different starting pointsfor the FFD, RSM and CFFD approaches. In Fig. 4-(a) we see the learning curves for FFD learning (FFD1) and for RSMlearning (RSM1). The curve corresponding to FFD1 (*), is characterized by η=0.2, α=0 and final values: w=1.0, R=4.14and PI=119.6, i.e., FFD1 “fell” into one of the local minima. The same happens with RSM1 (o) (η=0.2, α=.01) that endsat w=1.0, R=1.5 and PI=119.6. In Fig. 4-(b) we can see the learning curves for the cyclic finite differences (CFFD) forthree cases. The first curve corresponds to CFFD1 (*) and has been obtained for η=0.5 and α=0.05 and converges tow=1.3, R=10.6, PI=119.6. A value of w larger then 1 and the value of PI, show us that this learning has converged to oneof the local minima. The second curve, CFFD2 (o), was obtained for η=0.8 and α=0.1 and converges to w=1.0 R=4.51and PI=119.6. Even in this case we are trapped in a local minimum. In the third case, CFFD3 (+), lower values for η and

α have been used, i.e., η=0.2 α=0. With these values the algorithm converges again to a local minimum given by: w=1.8,R=1.2 and PI=119.6. By observing these curves and the corresponding values of η and α, we can conclude that:• η and α must not be too low at the beginning of the learning, as the learning procedure needs to start moving quickly

towards the optimum;• η and α cannot be too high because of irregularities in the PI function. It can occur (CFFD1 and CFFD3) that R

becomes smaller than 1, thus causing a large decrement in the PI. A subsequent recovery is possible, but is notassured.

We can summarize these two considerations by saying that we need to learn quickly, avoiding dangerous instabilities.Finally, RSM seems to be really slow and oscillatory in convergence, even though requiring an high number of simulationruns at each iteration. Thus, RSM is not considered further in the following.

The preceding considerations suggest that we modify the FFD and CFFD with the standard assumption made whenusing stochastic gradient algorithms, i.e., using a scheduled decrease in the learning rate (the learning rate at iteration k isη/k). This will provide a high learning rate initially, and a decreasing smaller rate while approaching the maximum. In Fig.5-(a) we see the learning curve for a FFD with decreasing learning rate, FFD2, characterized by η=0.8, α=0.01 and finalpoint: w=0.76, R=0.9 and PI=76. It is obvious that FFD2 does not learn successfully, but, it is interesting to note theoscillation in the learning curve. Indeed, the learning is successful in terms of “envelope” of the learning curve. Theoscillation is caused by repeated crossing of the value R=1; this causes quick performance degradation that the FFD iseventually capable to recover from. Figure 5-(b) shows the learning curves for CFFD modified with decreasing learningrate. The first curve, CFFD4 (*), is obtained from η=0.3 and α=0.05 and leads to w=0.62, R=1.3 and PI=137.6. Thesecond curve, CFFD5 (o), is obtained from η=0.7 and α=0.05 and leads to w=0.66, R=1.2 and PI=149.5. In both cases the

0 10 20 30 40 50 60-40

-20

0

20

40

60

80

100

120

140

160

iteration

PI

FFD1

RSM1

5 10 15 20-100

-50

0

50

100

150

200

250

iteration

PI

CFFD1

CFFD3

CFFD2

(a) (b)Figure 4. Learning Curves

0 10 20 30 40 50 60 70 80-100

-50

0

50

100

150

200

iteration

PI

FFD2

0 5 10 15 20 25-50

0

50

100

150

200

iteration

PI

CFFD4

CFFD5

(a) (b)

Figure 5. Learning Curves with decreasing learning rate

situation is slightly better, but there still is a sort of instability that enables the algorithm to discard good values of PI.Moreover, the learning procedure reaches a point where it cannot go further since the learning coefficient has decreased somuch that there is no significant learning.

To help us in the learning process, we can also use some knowledge of the system, that is, we know that we alwayswant to send at least one machine to repair if the threshold condition is met. This means that R is not allowed to fall tovalues smaller than unity. By putting such a constraint on R as a barrier function that blocks R at 1 if there is any attempt(from the learning algorithm) to decrease it further, we obtain the last series of path search methods (still using adecreasing learning coefficient) whose learning curves are shown in Fig. 6. Figure 6-(a) shows the learning curves forthree different FFD each characterized by: FFD3, (*), η=0.8, α=0.01, final w=0.65, R=1.0, PI=184; FFD4, (o), η=0.6,α=0.01, final w=0.7, R=1.0, PI=199; FFD5, (+), η=0.3, α=0.01, final w=0.68, R=1.0, PI=190. Figure 6-(b) shows twolearning curves obtained using the CFFD method. The first curve, CFFD6 (*), was obtained for η=0.6 and α=0.1, andconverges to w=0.68, R=1.0 and PI=189.2. The second curve, CFFD7 (o), was obtained for η=0.8 and α=0.2, andconverges to w=0.72, R=1.0 and PI=187.6. This final modification definitely gives better results than the two previousones. There is still the problem of reaching the maximum and stopping there, as we can see from the learning curves.These three methods give good results, but it is difficult to approach the optimum in a satisfactory way, since there aremany irregularities that remain even when using longer simulations or more runs. Therefore, assuming that the irregularityis a part of the statistical nature of the problem, a final method has been used: evolutionary programming.

Using a standard EP approach, a population of 10 individuals was used. With a deterministic process the best four werechosen; and, then four off-spring were generated from them. The remaining two individuals were randomly chosen. Theperturbations used to determine the off-spring were assumed to be additive terms, uniformly distributed in [-0.1, 0.1] forw, and in [-0.5, 0.5] for R. New random individuals are sampled from uniform distributions in [0,1] and [0,5] for w and R,respectively. The initial population is chosen randomly. In Fig. 7 the best and worst case individuals among the four bestin each generation are plotted versus the generation number for three differently “seeded” evolutionary algorithms (EAs).The final four best individuals for these three different EAs are given in Table 1. Though this kind of learning algorithm isdependent on initialization, and offers no guarantee that a certain number of generations will give good results; the results

0 5 10 15 20 25 300

20

40

60

80

100

120

140

160

180

200

iteration

PI

FFD3

FFD5

FFD4

0 5 10 15 20-50

0

50

100

150

200

iteration

PI

CFFD6 CFFD7

(a) (b)

Figure 6. Learning Curves with constraint on R

0 10 20 30 40 50100

120

140

160

180

200

generation

PI

0 20 40 60 80 10080

100

120

140

160

180

200

generation

PI

0 20 40 60 80 10060

80

100

120

140

160

180

200

generation

PI

(a) (b) (c)

Figure 7. Learning Curves - Evolutionary Algorithm

are fairly consistent as well as almost optimal. The procedure is not fast (but not excessively time consuming); and it is anautomatic off-line method.

5.2. On-line reinforcement learning

For the on-line reinforcement learning it has been assumed that the sources of variability in the operating conditions(critical perturbing elements, CPE) of the plant are the mean time between failures (MTBF) and the mean time to repair(MTTR) of the machines. The CPE subspace is considered to be [4,10]×[0.5,1.5], that is, it is assumed that (from someknowledge of the plant) the MTBF will be eventually varying in [4,10] and the MTTR in [0.5,1.5]. The CPE subspace istherefore “explored” by trying to find the optimal w and R for some points in it. An EA like the one described in the

preceding section was used and the points that are explored lie on auniformly spaced 4×3 grid on the CPE subspace. This means that theEA is run on all the points in 4, 6, 8, 10×0.5, 1, 1.5 and the“optimal” solutions are found for each point. In every solution theoptimal value for R turns out to be always 1.0 (even though the EA isnot assuming any kind of constraint on R), thus suggesting that it isbetter to always send only one machine to repair and to adjust thethreshold implicitly defined by w. For this reason from now on R willbe considered to be set to 1 without further discussion. Table 2 showsthe optimal results in the explored points along with the optimal PI.

The next step is to efficiently interpolate (generalize) betweenthese optimal points. Well known approximating paradigms havebeen used, namely, fuzzy logic systems (FLS) and artificial neuralnetworks (ANN).8,13 A fuzzy logic system with two antecedents(MTBF and MTTR) and one consequent (w), was designed from the“exploration” data in Table 2. The corresponding membershipfunctions are shown in Fig. 8 and the set of 12 rules is given in Table3. The FLS uses singleton fuzzification, max-product inference, maxrule composition and modified height defuzzification. Artificialneural networks constitute a second interpolating paradigm. In thiscase a 2-10-1 ANN was trained on the “exploration” data with anerror goal of 10-4, learning coefficient of 0.8 and momentumcoefficient of 0.3. This network has two inputs (MTBF and MTTR)and one output (w) and it uses sigmoidal squashing functions. Boththe FLS and the ANN have the MTBF and the MTTR as inputs.Obviously those values are not known a-priori and some kind ofestimates are needed. Therefore, the times between failure and thetimes to repair are collected during the simulation and their averageserves as estimates for MTBF and the MTTR. Since the plant isdynamic and MTBF and MTTR will change over time, the abovementioned average is taken on the data collected in some slidingwindow, the dimension of this window becomes a design issue. TheFLS and ANN are now ready to operate on-line as adaptationmodules; therefore, they will monitor the evolution of the plant

First EA (a) Second EA (b) Third EA (c)

w=0.7203 R=1.0000 PI=196 w=0.6808 R=1.0008 PI=191 w=0.6830 R=1.0050 PI=194w=0.7217 R=1.0068 PI=194 w=0.6862 R=1.0278 PI=189 w=0.6858 R=1.0192 PI=193w=0.7223 R=1.0099 PI=193 w=0.6818 R=1.0054 PI=187 w=0.6826 R=1.0031 PI=192w=0.7218 R=1.0072 PI=192 w=0.6834 R=1.0138 PI=187 w=0.6847 R=1.0136 PI=188

Table 1 - Evolutionary Algorithm - best individuals

Table 2 - Performance on exploration points: optimalcontroller, fuzzy and neural adaptation modules

MTBF ↓ MTTR → 0.5 1.0 1.5

4

Optimal

FL10FL50FL200NN10NN50NN200

w=0.86PI=310

311311309315312308

w=0.70PI=114

117689110810894

w=0.06PI=-73-191-233-109-158-388-213

6

Optimal


w=0.94PI=370

364365369367367367

w=0.86PI=256

255246254244250247

w=0.66PI=127

15611076106111105

8

Optimal


w=0.94PI=393

392393394393393393

w=0.87PI=313

315313312311306313

w=0.79PI=232

239213213215219232

10

Optimal


w=0.95PI=409

408409409407410410

w=0.92PI=347

346345349345349346

w=0.85PI=284

289281282261276285

through the sliding window and impose changes to the controller parameter accordingly.A first test of the adaptation module can be done by checking its performance on the exploration points, from which it

was created. Table 2 summarizes the results for this first test, where FLx (NNx) stands for fuzzy logic (neural network)adaptation module with sliding window of dimension x. Both ANN and FLS give good overall results. More specifically,from Table 2 we can see that there is a major performance degradation in (4,1.5) with respect to optimal performances,with both FLS and ANN. The only case where this degradation is contained is for a FLS using a sliding window of 200observations. The problem in this case seems to be the big performance degradation corresponding to such values ofMTBF and MTTR. Indeed, a high MTTR and a low MTBF correspond to frequently failing machines that take a long timeto repair. This gives substantial additional costs that make the situation difficult to manage; indeed small variations in wcause big variations (decreases) in the PI. There is some smaller performance degradation (for some values of the slidingwindow dimension) also in (4,1.0), (6,1.5) and (8,1.5) in Table 2. A sliding window of 10 seems to give good results forboth the FLS and the ANN. The latter is 10% sub-optimal in (6,1.5), regardless of the sliding window size. All other datashow performances very close to optimal ones. The general trend suggested by the data is that a small sliding windowworks well with the fuzzy adaptation module, while a big sliding window works well (except few cases) with the neuraladaptation module. It is interesting to note that in some instances the performances of the system with adaptation modulesseem to be better than optimal. Though this might seem strange, there are two reasons that explain those results. First ofall, in the comparison we have to take into account the variability inherent to the plant, which makes small differencesnegligible. Moreover, what we are calling the “optimal performances” or the “optimal controller” are only optimalsolutions for that particular type of controller, and not for the problem itself. Since the static controller works with fixedparameters for all the simulation; and, the dynamic controller (with the adaptation) may change the value of its parametersat every time in the simulation, the optimal performances of these two controllers will be different in general.

Another test of the adaptation modules consists in checking their extrapolation (even though we are assuming we willnot need any, it is interesting to check) and interpolation capabilities. Table 4 gives the results of simulations for theoptimal controller (determined with an EA) and for the neural and fuzzy adaptation in (2,0.5), (10,2) (i.e., extrapolation)and (5,1) (i.e., interpolation). The neural adaptation seems to have fairly good extrapolation capabilities and goodinterpolation ones. On the other hand, the fuzzy adaptation works well in terms of interpolation (regardless of the windowsize) and extrapolation at (10,2) (with small window size), but it gives unsatisfactory performances for (2,0.5). This isprobably caused by the type of membership functions that were used. The “saturation” of the membership functions couldbe the cause of bad extrapolating capabilities. These bad extrapolation capabilities are not a problem in our case, since oneof the initial assumptions in the method is that the CPEs are going to vary in some subspace that we explore; therefore, we

are practically only interested in interpolation properties.Nonetheless it is interesting to compare the fuzzy and the neuralextrapolation capabilities.

The final test is the one that simulates a real operating situation.Therefore, it is assumed that every 1000 time units new values ofMTBF and MTTR are uniformly sampled from the CPE subspace([4,10]×[0.5,1.5]). The simulation was run for 50,000 time unitsand fixed controllers were compared to dynamic controllers usingthe adaptation module. In particular fixed controllers with w=0.6,0.7, 0.8 and 0.9 were used since those values of w are in the rangeof the optimal values found previously. The results for this last testare shown in Table 5. Here we see that the fuzzy or neuraladaptation is effective in giving performances that are more than

4 6 8 10

1S VLM L

.5 1 1.5

1S M L

.06

1VS VLM L

.68 .79 .86 .94

S

(a) (b) (c)Figure 8. Membership functions for MTBF (a), MTTR (b) and w (c)

MTTR →MTBF ↓

S M L

S L S VSM VL L SL VL L M

VL VL VL L

Table 3. Rule base

MTBF=2MTTR=0.5

MTBF=10MTTR=2

MTBF=5MTTR=1

Optimal


w=0.66PI=110

-2-24

1139996

w=0.79PI=215

225150158160195194

w=0.72PI=196

199193199203187199

Table 4. Extrapolation and interpolation of thefuzzy and neural modules

20% better than the ones with the fixed controllers. Moreover, we note that a small slidingwindow seems to be the best choice for both the fuzzy and the neural adaptation modules.

6. Conclusions

A general reinforcement learning paradigm for both off-line and on-line adaptation of aparametric controller for a DEDS was presented. To demonstrate its use, a machine repairexample was formulated and its DEVS model was developed. The off-line paradigm usedstandard simulation optimization techniques (finite differences, response surfacemethodologies) and advanced random search (evolutionary programming) to determine theoptimal controller. The on-line reinforcement learning used the approximation paradigms offuzzy logic systems and artificial neural networks to provide adaptation. Both adaptationapproaches (FLS and ANN) showed good approximation and adaptation properties. Theneural adaptation also showed better extrapolation capabilities, even though extrapolation is not an issue.

The proposed method is still in its early phase, and needs further study and experimentation to prove its general validityand applicability. Furthermore, some theoretical considerations need to be addressed, e.g., regarding the type of“exploration” of the critical-perturbing-elements subspace to perform.

ACKNOWLEDGMENT

The authors would like to thank DuPont for support on the development of this paper and the anonymous reviewers fortheir important suggestions.

REFERENCES

[1] C.G. Cassandras, Discrete Event Systems: Modeling and Performance Analysis, Richard D. Irwin and AksenAssociates Inc. Publishers, Boston, MA, 1993.

[2] B.P. Zeigler, Theory of Modelling and Simulation, John Wiley and Sons, 1976.[3] B.P. Zeigler, Multifacetted Modelling and Discrete Event Simulation, Academic Press, 1984.[4] B.P. Zeigler, "DEVS Representation of Dynamic Systems: Event-Based Intelligent Control," Proceedings of the

IEEE, vol. 77, n. 1, pp.72-80, January 1989.[5] B.P. Zeigler, Object Oriented Simulation with Hierarchical Modular Models, Academic Press, 1990.[6] D.E. Kirk, Optimal Control Theory: An Introduction, Prentice-Hall Inc., Englewood Cliffs, NJ, 1970.[7] P.J. Werbos, "Approximate dynamic programming for real-time control and neural modeling," Handbook of

Intelligent Control: Neural, Fuzzy and Adaptive Approaches, (D.A. White and D.A. Sofge editors), Van NostrandReinhold, New York, 1992.

[8] S. Haykin, Neural networks: A Comprehensive Foundation, Macmillan College Publishing and IEEE Press, 1994.[9] S.H. Jacobson and L.W. Schruben, "Techniques for Simulation Response Optimization," Operations Research

Letters, vol. 8, n. 1, pp. 1-9, February 1989.[10] M.H. Safizadeh, "Optimization in Simulation: Current Issues and the Future Outlook," Naval Research Logistics,

vol. 37, pp. 807-825, 1990.[11] M.S. Bazaraa, H.D. Sherali and, C.M. Shetty, Nonlinear Programming: Theory and Algorithms, John Wiley and

Sons Inc., NY, 1993.[12] D.B. Fogel, Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, IEEE Press, NY,

1995.[13] J.M. Mendel "Fuzzy logic systems for engineering: a tutorial," Proceedings of the IEEE, vol. 83, no. 3, pp. 345-

377, March 1995.

Controller PIw=0.6 174w=0.7 197w=0.8 192w=0.9 183FL10 252FL50 241FL200 244NN10 243NN50 240NN200 235

Table 5. Adaptation

PAOLO DADONE received the "laurea" degree with honors in electronic engineering in 1995from the Politecnico di Bari, Italy and the M.S. degree in electrical engineering in 1997 fromthe Virginia Polytechnic Institute and State University, USA where he currently is a PhDstudent. His research interests are in intelligent control algorithms development andimplementation, discrete event dynamic systems and manufacturing systems. Mr. Dadone is amember of several honor societies as well as IEEE technical societies. He was the recipient ofthe 1996 IEEE VMS graduate student paper contest and of the 1996 Politecnico di Barifellowship for studying abroad.

HUGH VANLANDINGHAM is a professor in the Bradley Department of Electrical andComputer Engineering, where he has served since September 1966. He received his B.E.E.degree from N.C. State University in 1957, M.E.E. from N.Y.University in 1959 and a Ph.D.in Electrical Engineering from Cornell University in 1967. From 1957 to 1962 Dr.VanLandingham worked in the Bell Telephone Labs, Whippany NJ. His research has, or isbeing supported by NASA, ONR, NSF, NSWC, Lockeed, DuPont, and Eastman Chemical. Heis the author of three textbooks in the areas of signal processing and control and has publishedmore than 70 technical papers in journals and international conferences. Dr. VanLandinghamhas taught courses in virtually all areas of the undergraduate curriculum. At the graduate level

his teaching concentration has been in the areas of random processes, signal processing and automatic control systems.Whereas earlier research was focused on conventional digital methods, primarily digital control systems, his more recentareas of interest are in applications of "soft computing." This broad area is a subset of artificial intelligence that has to dowith paradigms that mimic Nature. Included in this soft computing area are the studies of artificial neural networks, fuzzylogic systems and evolutionary computation.

BRUNO MAIONE received the laurea in electrical engineering with honors from the University of Naples. Currently, heis full professor of Automatic Control at the Department of Electrical and Electronic Engineering of the Polytechnic ofBari. In 1983 and 1985 he was a visiting professor with The University of Florida, Gainesville. He held the position ofDean of Faculty of Engineering from 1986 to 1992. His primary areas of research are discrete event dynamical systemsand intelligent control.

A SIMULATOR-BASED REINFORCEMENT-LEARNING PARADIGM€¦ · Modified gradient learning methods and...

Documents

Transcript of A SIMULATOR-BASED REINFORCEMENT-LEARNING PARADIGM€¦ · Modified gradient learning methods and...