A NOVEL OPTIMIZATION ALGORITHM BASED ON...

22
A NOVEL OPTIMIZATION ALGORITHM BASED ON REINFORCEMENT LEARNING Janusz A. Starzyk, Yinyin Liu, Sebastian Batog Abstract In this chapter, an efficient optimization algorithm is presented for the problems with hard to evaluate objective functions. It uses the reinforcement learn- ing principle to determine the particle move in search for the optimum process. A model of successful actions is build and future actions are based on past ex- perience. The step increment combines exploitation of the known search path and exploration for the improved search direction. The algorithm does not require any prior knowledge of the objective function, nor does it require any characteristics of such function. It is simple, intuitive and easy to implement and tune. The opti- mization algorithm was tested using several multi-variable functions and compared with other widely used random search optimization algorithms. Furthermore, the training of a multi-layer perceptron, to find a set of optimized weights, is treated as an optimization problem. The optimized multi-layer perceptron was applied to Iris database classification. Finally, the algorithm is used in image recognition to find a familiar object with retina sampling and micro-saccades. 1 Introduction Optimization is a process to find the maximum or the minimum function value within given constraints by changing values of its multiple variables. It can be the Janusz A. Starzyk Ohio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail: [email protected] Yinyin Liu Ohio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail: [email protected] Sebastian Batog Silesian University of Technology, Institute Of Computer Science, Poland e-mail: sebas- [email protected] 1

Transcript of A NOVEL OPTIMIZATION ALGORITHM BASED ON...

Page 1: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

A NOVEL OPTIMIZATION ALGORITHMBASED ON REINFORCEMENT LEARNING

Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

Abstract In this chapter, an efficient optimization algorithm is presented for theproblems with hard to evaluate objective functions. It uses the reinforcement learn-ing principle to determine the particle move in search for the optimum process.A model of successful actions is build and future actions are based on past ex-perience. The step increment combines exploitation of the known search path andexploration for the improved search direction. The algorithm does not require anyprior knowledge of the objective function, nor does it require any characteristicsof such function. It is simple, intuitive and easy to implement and tune. The opti-mization algorithm was tested using several multi-variable functions and comparedwith other widely used random search optimization algorithms. Furthermore, thetraining of a multi-layer perceptron, to find a set of optimized weights, is treated asan optimization problem. The optimized multi-layer perceptron was applied to Irisdatabase classification. Finally, the algorithm is used in image recognition to find afamiliar object with retina sampling and micro-saccades.

1 Introduction

Optimization is a process to find the maximum or the minimum function valuewithin given constraints by changing values of its multiple variables. It can be the

Janusz A. StarzykOhio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail:[email protected]

Yinyin LiuOhio University, School of Electrical Engineering and Computer Science, U.S.A. e-mail:[email protected]

Sebastian BatogSilesian University of Technology, Institute Of Computer Science, Poland e-mail: [email protected]

1

Page 2: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

2 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

essential for solving complex engineering problems in such areas as computer sci-ence, aerospace, machine intelligence applications, etc. When the analytical relationbetween the variables and the objective function value is explicitly known, analyti-cal methods, such as Lagrange multiplier methods [1], interior point methods [18],Newton methods [30], gradient descent methods [25], etc., can be applied. How-ever, in many practical applications, analytical methods do not apply. This happenswhen the objective functions are unknown, when relations between variables andfunction value are not given or difficult to find, when the functions are known whiletheir derivatives are not applicable, or when the optimum value of function cannotbe verified. In these cases, iterative search processes are required to find the functionoptimum.

Direct search algorithms [10] contain a set of optimization methods that do notrequire derivatives and do not approximate either the objective functions or theirderivatives. These algorithms find locations with better function values following asearch strategy. They only need to compare the objective function values in succes-sive iterative steps to make the move decision. Within the category of direct search,distinctions can be made among three classes including pattern search methods [28],simplex methods [6], and adaptive sets of search directions [23]. In pattern searchmethods, the variables of the function are varied by either steps of predeterminedmagnitude or the steps sizes are reduced at the same degree [15]. Simplex meth-ods construct a simplex in ℜNusing N+1 points and use the simplex to drive thesearch for optimum. The methods with adaptive sets of search directions, proposedby Rosenbrock [23] and Powell [21], construct conjugate directions using the infor-mation about the curvature of the objective function during the search.

In order to avoid local minima, random search methods are developed utilizingrandomness in setting the initial search points and other search parameters like thesearch direction or the step size. In Optimized Step-Size Random Search (OSSRS)[24], the step size is determined by fitting a quadratic function for the optimizedfunction values in each of the random directions. The random direction is generatedwith a normal distribution of a given mean and standard deviation. Monte-Carlo op-timizations adopted randomness in the search process to generate the possibilitiesto escape from the local minima. Simulate Annealing (SA) [13] is one typical kindof Monte-Carlo algorithm. It exploits the analogy between the search for a mini-mum in the optimization problem and the annealing process in which a metal coolsand stabilizes into a minimum energy crystalline structure. It accepts the move to anew position with worse function value with a probability, which is controlled bythe ”temperature” parameter, and the probability decreases along the ”cooling pro-cess”. SA can deal with highly nonlinear, chaotic problems provided that the coolingschedule and other parameters are carefully tuned.

Particle Swarm Optimization (PSO) [11] is a population-based evolutionary com-putational algorithm. It exploits the cooperation within the solution population in-stead of the competition among them. At each iteration in PSO, a group of searchparticles make moves in a mutually coordinated fashion. The step size of a particle isa function of both the best solution found by that particle and the best solution foundso far by all the particles in the group. The use of a population of search particles

Page 3: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 3

and the cooperation among them enable the algorithm to evaluate function values ina wide range of variables in the input space and to find the optimum position. Eachparticle only remembers its best solution and the global best solution of the groupto determine its step sizes.

Generally, during the course of search, a sequence of decisions on the step sizes ismade and a number of function values are obtained in these optimization methods.In order to implement an efficient search for the optimum point, it is desired thatsuch historical information can be utilized in the optimization process.

Reinforcement Learning (RL) [27] is a type of learning process to maximize cer-tain numerical values by combining exploration and exploitation and using rewardsas learning stimuli. In the reinforcement learning problem, the learning agent per-forms the experiments to interact with the unknown environment and accumulatethe knowledge during this process. It is a trial-and-error exploratory process withthe objective to find the optimum action. During this process, an agent can learnto build the model of the environment to instruct its search, so that the agent canpredict the environment’s response to its actions and choose the most useful actionsfor its objectives based on its past exploring experience.

Surrogate based optimization refers to an idea of speeding optimization processby using surrogates for the objectives and constraints functions. The surrogates alsoallow for the optimization of problems with non-smooth or noisy responses, andcan provide insight into the nature of the design space. The max-min SAGA ap-proach [20] is to search for designs that have the best worst case performance inthe presence of parameter uncertainty. By leveraging a trust-region approach whichuses computationally cheap surrogate models, the present approach allows for thepossibility of achieving robust design solutions on a limited computational budged.

Another example of a surrogate based optimization is the surrogate assistedHooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of aglobal optimization algorithm. This local searcher uses the Hooke-Jeeves method,which performs its exploration of the input space intelligently employing both thereal fitness and an approximated function.

The idea of building knowledge about an unknown problem through explorationcan be applied in the optimization problems. To find the optimum of an unknownmultivariable function, an efficient search procedure can be performed using onlyhistorical information from conducted experiments to expedite the search. In thischapter, a novel and efficient optimization algorithm based on reinforcement learn-ing is presented. This algorithm uses simple search operators and will be calledreinforcement learning optimization (RLO) in the later sections. It does not requireany prior knowledge of the objective function or function’s gradient information,nor does it require any characteristics of the objective function. In addition, it isconceptually very simple and easy to implement. This approach to optimizationis compatible with the neural networks and learning through interaction, thus it isuseful for systems of embodied intelligence and motivated learning as presented in[26]. The following section presents the RLO method and illustrates it within severalmachine learning applications.

Page 4: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

4 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

2 Optimization Algorithm

2.1 Basic search procedure

A N-variable optimization objective function

V = f (p1, p2, ..., pN) (p1, p2, ..., pN ,V ∈ ℜ1)

could have several local minima and several global minima Vopt1, ...,VoptN . It isdesired that the search process, initiated from a random point, finds a path to theglobal optimum point. Unlike particle swarm optimization [11], this process can beperformed with a single search particle that learns how to find its way to the opti-mum point. It does not require the cooperation among a group of particles, althoughimplementing the cooperation among several search particles may further enhancethe search process in this method.

At each point of the search, the search particle intends to find a new locationwith a better value within a searching range around it and then determines the di-rection and the step size for the next move. It tries to reach the optimum by explor-ing weighted random search of each variable (coordinate). The step size of searchin each variable is randomly generated with its own probability density function.These functions are gradually learned during the search process. It is expected thatat the later stage of search, the probability density functions are approximated foreach variable. Then the stochastically randomized path to the minimum point of thefunction from the start point is learned.

The step sizes of all the coordinates determine the center of the new searchingarea and the standard deviations of the probability functions determine the size ofthe new searching area around the center. In the new searching area, several loca-tions PS are randomly generated. If there is a location p’ with better value than thecurrent one, the search operator moves to it. From this new location, new step sizesand new searching range are determined, so that the search for optimum continues.If in the current searching area, there is no point with better value that the searchparticle can move to, another set of random points are generated until no improve-ment is obtained after several, say M, trials. Then the searching area size and stepsizes are modified in order to find a better function value. If no better value is foundafter K trials of generating different searching areas or the proposed stopping crite-rion is met, we can claim that the optimum point has been found. The algorithm ofsearching for the minimum point is schematically shown in the Figure 1.

Page 5: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 5

Fig. 1 The algorithm of RLO searching for the minimum point

2.2 Extracting historical information by weighted optimizedapproximation

After the search particle makes a sequence of n moves, the step sizes of these movesd pi

t (t = 1, 2, ..., n; i = 1, 2, ..., N) are available for learning. These historical stepshave made the search particle move towards better values of the objective functionand hopefully get closer to the optimum location. In this sense, these steps are thesuccessful actions during the trial. It is proposed that the successful actions whichresult in a positive reinforcement (as the step sizes of each coordinate) follow afunction of the iterative steps t, as in (1), where dpi represents the step sizes on ithcoordinate and f i(t) is the function for coordinate i.

d pi = fi(t) (i = 1,2, ...,N), (1)

These unknown functions f i(t) can be approximated, for example, using polynomi-als through the least-squared fit (LSF) process.

1 t1 t21 ...

... ... ... ...1 tn t2

n ...

tB1...tBn

a0a1a2...aB

=

d pi1

...d pi

n

(2)

In (2), the step sizes from d pi1 to d pi

n are the step sizes on a certain coordinate duringn steps and are fitted as unknown function values using polynomials of order B. Thepolynomial coefficients a0 to aB can be obtained and will represent the function f i(t)to estimate dpi,

Page 6: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

6 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

d pi =B

∑j=0

a jt j. (3)

Using polynomials for function approximation could be easy and efficient. How-ever, considering the characteristic of optimization problems, we have two concerns.First, in order to generate a good approximation while avoiding overfitting, a properorder of polynomials must be selected. In the optimized approximation algorithm(OAA) presented in [17], the goodness of fit is determined by the so-called signal-to-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterionwas developed. Using a certain set of basis functions for approximation, the errorsignal, computed as the difference between the approximated function and the sam-pled data, can be examined by SNRF to determine how much useful information itcontains. The SNRF for the error signal, denoted as SNRFe, is compared to the pre-calculated SNRF for white Gaussian noise (WGN), denoted as SNRFWGN . If SNRFeis higher thanSNRFWGN , more basis functions should be used to improve the learn-ing. Otherwise, the error signal shows the characteristic of WGN and should notbe reduced any more to avoid fitting into the noise, and the obtained approximatedfunction is the optimum function. Such process can be applied to determine theproper order of the polynomial.

The second concern is that in the case of reinforcement learning, the knowl-edge about originally unknown environment is gradually accumulated throughoutthe learning process. The information that the learning system obtains at the be-ginning of the process is mostly based on initially random exploration. During theprocess of interaction, the learning system collects the historical information andbuilds the model of the environment. The model can be updated after each step ofinteraction. The decisions made at the later stages of the interaction are more basedon the built model rather than a random exploration. This means that the recent re-sults are more important and should be weighted more heavily than the old ones.For example, the weights applied can be exponentially increasing from the initialtrials to the recent ones, as

wt =α t

n(t = 1,2, ...,n), (4)

where we can define αn = n. As a result, the weights are in the open interval (0:1],and weight is 1 for the most recent sample. Applying the weights in the LSF, wehave the weighted least-squared fit (WLSF), expressed as follows.

1 ·w1 t1w1 t21 w1 ...

... ... ... ...1 ·wn tnwn t2

n wn ...

tB1 w1...

tBn wn

a0a1a2...aB

=

d p1w1...

d pnwn

(5)

Due to the weights applied to the given samples, the approximated function willfit to the recent data better than to the old ones.

Page 7: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 7

Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the errorsignal or WGN has to be estimated considering the sample weights. In the originalOAA for one-dimensional problem [17], the SNRF for error signal was calculatedas,

SNRFe =C(e j,e j−1)

C(e j,e j)−C(e j,e j−1)(6)

where C represents the correlation calculation, e j represents the error signal (j=1,2,...,n),e j−1 represents the (circular) shifted version of the e j. The characteristics of SNRFfor WGN, expressed through the average value and the standard deviation, can beestimated from Monte-Carlo simulation, as (see derivation at [17])

µSNRF WGN(n) = 0 (7)

σSNRF WGN(n) =1√n. (8)

Then the threshold, which determines whether SNRFe shows the characteristic ofSNRFWGN and the fitting error should not be further reduced, is,

thSNRF WGN(n) = µSNRF WGN(n)+1.7σSNRF WGN(n). (9)

For the weighted approximation, the SNRF for the error signal is calculated as,

SNRFe =C(e j ·w j,e j−1 ·w j−1)

C(e j ·w j,e j ·w j)−C(e j ·w j,e j−1 ·w j−1). (10)

In Fig.2(a), σSNRF WGN(n) from a 200-run Monte-Carlo simulation is shown in thelogarithmic scale. The σSNRF WGN(n) can be estimated as

σSNRF WGN(N) =2√n. (11)

It is found that the 5% significance level can be approximated by the average valueplus 1.5 times standard deviations for an arbitrary n. Fig.2(b) illustrates the his-togram of SNRFWGN with 216 samples, as an example. The threshold in this case ofa dataset with 216 samples can be calculated using µ +1.5σ = 0+1.5×0.0078 =0.0117.

Therefore, to obtain an optimized weighted approximation in one-dimensionalcase, the following algorithm is performed.

Optimized weighted approximation algorithm (OWAA):Step (1). Assume that an unknown function F, with input space t ⊂ ℜ1 is describedby n training samples as d pt , (t = 1, 2, ..., n).Step (2). The signal detection threshold is pre-calculated for the given number ofsamples n based on SNRFWGN . For a one-dimensional problem,

Page 8: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

8 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

Fig. 2 Characteristic of SNRF for WGN in weighted approximation

thSNRF WGN(n) =1.5 ·2√

n.

Step (3). Take a set of basis functions, for example, polynomials of order from 0 upto order B.Step (4). Use these B+1 basis functions to obtain the approximated function,

d pt =B+1

∑l=1

fl(xt) (t = 1,2, ...,n). (12)

Step (5). Calculate the approximation error signal,

et = d pt − d pt (t = 1,2, ...,n). (13)

Step (6). Determine SNRF of the error signal using (10).Step (7). Compare the SNRFe with thSNRF WGN . If the SNRFe is equal to or lessthan thSNRF WGN , or if B exceeds the number of samples, stop the procedure. Insuch case F is the optimized approximation. Otherwise, add one basis function, inthis example increase the order of the approximating polynomial to B+1 and repeatSteps (4)-(7).

Using the above algorithm, the proper order of polynomial is determined to ex-tract the useful (but not the noise) information from the historical data. Also, theextracted information will fit into the recent results better than to the old ones.

We illustrate this process of learning historical information by considering a 2-variable function as an example.

Page 9: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 9

Example.

The function V (p1, p2) = p22 sin(1.5p2)+2p2

1 sin(2p1)+ p1 sin(2p2) has severallocal minima, but only one global minimum, as shown in Fig. 3. In the process ofinteraction, the historical information after each iteration is collected. The historicalstep sizes of 2 coordinates are separately approximated, as shown in the Fig. 4 (a)and (b). The step sizes of two coordinates are approximated by quadratic polynomi-als which are determined by OWAA and the coefficients of polynomials are obtainedusing WLSF. In Fig. 4, the approximated functions are compared with the quadraticpolynomials whose coefficients are obtained from LSF. Again, it is observed that,the function obtained using WLSF is fitted closer to the data in later iterations thanthe function obtained using LSF.

Fig. 3 A 2-variable function V (p1, p2)

Fig. 4 Function approximation for historical step sizes

Page 10: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

10 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

The level of the approximation error signal et for step sizes of a certain coordi-nate dpi, which is the difference between the observed sampled data and the approx-imated function, can be measured by its standard deviation, as shown in (14).

σpi =

√√√√1n

(n

∑t=1

(et − e)2

)(14)

This standard deviation will be called the approximation deviation in the follow-ing discussion. It represents the maximum deviation of the location of the searchparticle from the prediction by the approximated function in the unknown functionoptimization problem.

2.3 Predicting new step sizes

The approximated functions will be used to determine the step sizes for the nextiteration, as shown in (15) and Fig. 5 along with the approximated functions.

d pit+1 = f i(t +1) (15)

Fig. 5 Prediction of the step sizes for the next iteration

The step size functions are the model of environment that the learning systembuilds during the process of interaction based on historical information. The futurestep size determined by such model can be employed as exploitation of the existingmodel. However, such model built during the learning process cannot be treated asexact. Besides exploitation which best utilizes the obtained model, exploration is de-sired to a certain degree in order to improve the model and discover better solutions.The exploration can be implemented using Gaussian random generator (GRG). Asa good trade-off between exploitation and exploration is needed, we propose to use

Page 11: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 11

the step sizes for the next iteration determined by the step size functions as themean value and the approximation deviation as the standard deviation of the ran-dom generator. Gaussian random generators give several random choices of the stepsizes. Effectively, the determined step sizes of multiple coordinates generate thecenter of the searching area, and the size of the searching range is determined by thestandard deviations of GRG for the coordinates. The multiple random values gen-erated by GRG for each coordinate effectively create multiple locations within thesearching area. The objective function values of these locations will be comparedand the location with the best value, called current best location, will be chosen asthe place from which the search particle will continue searching in the next iteration.Therefore, the actual step sizes are calculated using the distance from the “previousbest location” to the “current best location”. The actual step sizes will be added inthe historical step sizes and used to update the model of the unknown environment.

Several locations of the search particle in this approach are illustrated in Fig.6 using a 2-variable function as an example. The search particle was located atprevious best location pprev(p1

prev, p2prev) and the previous step size was found as

d pprev(d p1prev,d p2

prev) after current best location p(p1, p2)is found as the best loca-tion in previous searching area (an area with p(p1, p2) in it, not shown in the figure).At current best position p(p1, p2), using the environment model built with historicalstep sizes, the current step size is determined to be dp1 on coordinate 1 and dp2 oncoordinate 2, so that the center of the searching area is determined. The approxima-tion deviations of two coordinates σp1 and σp2 give the size of the searching range.Within the searching range, several random points are generated in order to find abetter position to which the search operator will move.

Fig. 6 Step sizes and searching area

Page 12: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

12 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

2.4 Stopping criterion

Search particle moves from every “previous best location” to “current best location”and step sizes actually taken are used for model learning. As new step sizes aregenerated, the search particle is expected to move to locations with better objectivefunction values. In the proposed algorithm, the search particle only makes the movewhen a location with a better function value is found.

However, if all the points generated in the current searching range have no betterfunction values than the current best value, the search particle does not move andthe GRG will repeat generating groups of particle locations for several trials. If nobetter location is found after M trials, we suspect that the current searching rangeis too small or the current step size is too large, which makes us miss the locationswith better function values. In such case, we should enlarge the size of the searchingarea, and reduce the step size, as in (16),

σpi = α σpid pi = ε d pi (i = 1, 2, ..., N), (16)

where α > 1, and ε < 1. If this new search is still not successful, the searching rangeand the step size will continue changing until some points with better function valuesare found. If at certain step of the search process, in order to find the new locationwith better function values, the current step size is reduced to be too small to makethe search particle move anywhere, it indicates that the optimum point has beenreached. The stop criterion can be defined by the current step size being β timessmaller than the previous step size, as,

d p < β d pprev (0 < β < 1, β is usually small). (17)

2.5 Optimization algorithm

Based on previous discussion, the proposed optimization algorithm (RLO) can bedescribed as follows.(a). The procedure starts from a random point of the objective function with N-variables V = f (p1, p2, ..., pN) . It will try to make a series of moves to get closer tothe global optimum point.(b). To change from the current location, the step size dpi (i=1, 2, . . . ,N) and thestandard deviation σpi (i = 1,2, ...,N) for each coordinate are generated from theuniform probability distribution.(c). The step sizes dpi (i=1, 2, . . . N) determine the center of the searching area.The deviations of all the coordinates σpi (i = 1,2, ...,N) determine the size of thesearching area. Several points Ps in this range are randomly chosen from Gaussiandistribution using dpi as mean values and σpi as standard deviations.

Page 13: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 13

(d). The objective function values are evaluated at these new points. Compare theobjective function values on random points with that at the current location.(e). If the new points generated in Step (c) have no better values than the currentposition, Step (c ) is repeated for up to M trials until point with better function valueis found.(f). If the search fails after M trials, enlarge the size of the searching area, and reducethe step size, as in (16).(g). If the search with the updated searching area size and the step sizes from Step(f) is not successful, the range and the step size will keep being adjusted until eithersome points with better values are found, the current step sizes are much smallerthan previous step sizes as in (17), or function value changed by less than a pre-specified threshold. If any of these conditions happens then the algorithm termi-nates. This also indicates that the optimum point has been reached.(h). Move the search particle to the point p(p1, p2) with the best function value Vb (alocal minimum or maximum depending on the optimization objective). The distancebetween previous best point pprev(p1

prev, p2prev) and current best point p(p1, p2) gives

the actual step size dpi (i=1, 2, . . . , N). Collect the historical information of the stepsizes taken during the search process.(i). Approximate the function of the step sizes as a function of iterative steps usingweighted least-square fit as in (5). The proper maximum order of the basis functionsis determined using SNRF described in section 2.2 to avoid overfitting.(j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for thenext iteration step. The approximation deviation difference between the approxi-mated step sizes and the actual step sizes σpi(i = 1,2, ...,N) gives the approximationdeviation. Repeat Step (c) to (j).

In general, the optimization algorithm based on the reinforcement learning buildsthe model of successful moves for a given objective function. The model is builtbased on historical successful actions and it is used to determine new actions. Thealgorithm combines the exploitation and exploration of searching using random gen-erators. The optimization algorithm does not require any prior knowledge of the ob-jective function or its derivatives nor there are any special requirements put on theobjective function. The use of search operator is conceptually very simple and intu-itive. In the following section, the algorithm is verified using several experiments.

Page 14: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

14 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

3 Simulation and discussion

3.1 Finding global minimum of a multi-variable function

3.1.1 A synthetic bivariate function

A synthetic bivariate function

V (p1, p2) = p22 sin(1.5p2)+2p2

1 sin(2p1)+ p1 sin(2p2),

used previously in the example in section 2.2, is used as the objective function. Thisfunction has several local minima and one global minimum equal to -112.2586. Theoptimization algorithm starts at a random point and performs the search processlooking for the optimum point (minimum in this example). The number of randompoints Ps generated in the searching area in each step is 10. The scaling factors αand ε in (16) are 1.1 and 0.9. The β in (17) is 0.005.

One possible search path is shown in Fig. 7 from the start location to the finaloptimum location as found by RLO algorithm. The global optimum is found in13 iterative steps. The historical locations are shown in the figure as well. Duringthe search process, the historical step sizes taken are shown in Fig. 8 with theirapproximation by WLSF.

Fig. 7 Search path from start point to optimum

Example of another search process starting from another random point is per-formed and is shown in Fig. 9. The global optimum is found in 10 iterative steps.Table 4.1 shows changes in the numerical function values and adjustment of the stepsizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the stepsize was initially reduced to be increased again once the algorithm started to followa correct path towards the optimum.

Page 15: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 15

Fig. 8 Step sizes taken during the search process

Fig. 9 Search path from start point to optimum

Table 1. Function values and step sizes in a searching process

Search steps Function value V (p1, p2) Step size d p1 Step size d p2

1 1.4430 2.9455 0.86062 -34.8100 0.3570 -1.79243 -61.4957 -0.0508 -0.72994 -69.8342 -0.0477 -0.31145 -70.5394 -0.1232 0.20156 -71.5813 0.0000 4.43587 -109.0453 -0.0281 0.34088 -110.8888 0.0495 -0.05319 -112.0104 0.0438 -0.0772

10 -112.1666

Page 16: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

16 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

Such search process was performed for 300 random trials. The success rate offinding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 func-tion evaluations to find the optimum in this problem.

The same problems are tested on several other direct search based optimizationalgorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of find-ing global optimum and the average number of function evaluations are compared inTables 2, 3, 4. All the simulations were performed using an Intel Core Duo 2.2GHzbased PC, with 2GB of RAM.

Table 2. Comparison of optimization performances on synthetic function

RLO SA PSO OSSRS

Success rate of finding the global optimum 93.78% 29.08% 94.89% 52.21%Number of function evaluations 4299 13118 4087 313CPU time consumption [s] 28.4 254.35 20.29 1.95

3.1.2 Six-hump camel back function

The classic 2D six-hump camel back function [5] has 6 local minima and 2 globalminima. The function is given as

V (p1, p2) = (4 − 2.1p21 +

p41

3) p2

1 + p1 p2 + (−4 + 4p22) p2

2 (p1 ∈ [−3,3], p2 ∈[−2,2]).Within the specified bounded region, the function has 2 global minima equal to -1.0316. The optimization performances of these algorithms from 300 random trialsare compared in Table 3.

Table 3. Comparison of optimization performances on six-hump camel backfunction

RLO SA PSO OSSRS

Success rate of finding the global optimum 80.33% 45.22% 86.44% 42.67%Number of function evaluations 5016 8045.5 3971 256CPU time consumption [s] 33.60 151.86 20.35 1.63

3.1.3 Banana function

The Rosenbrock’s famous “banana function” [23], as

V (p1, p2) = 100(p2 − p21)

2 +(1− p1)2,

has 1 global minimum equal to 0 lying inside a narrow, curved valley. The opti-mization performances of these algorithms from 300 random trials are compared inTable 4.

Page 17: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 17

Table 4. Comparison of optimization performances on banana function

RLO SA PSO OSSRS

Success rate of finding the global optimum 74.55% 3.33% 41% 88.89%Number of function evaluations 48883.7 28412 4168 882.4CPU time consumption [s] 320.74 539.38 20.27 5.15

In these optimization problems, RLO demonstrates consistently satisfactory per-formance without particular tuning of the parameters. However, other methods showdifferent level of efficiency and capabilities of handling various problems.

3.2 Optimization of weights in multi-layer perceptron training

The output of a multi-layer perceptron (MLP) can be looked at as the value of afunction with the weights as the approximation variables. Training the MLP, in thesense of finding optimal values of weights to accomplish the learning task, can betreated as an optimization problem. We can take the Iris plant database [22] as atesting case. The Iris database contains 3 classes, 5 numerical features and 150 sam-ples. In order to accomplish the classification of the iris samples, a 3-layered MLPwith an input layer, a hidden layer and an output layer can be used. The size of theinput layer should be equal to the number of features. The size of the hidden layeris chosen to be 6, and since the class IDs are numerical values equal to 1, 2 and3, the size of the output layer is 1. The weight matrix between the input layer andthe hidden layer contains 30 elements, and the one between the hidden layer and theoutput layer contains 6 elements. Overall, there are 36 weight elements (parameters)to be optimized. In a typical trial, the optimization algorithm finds the optimal setof weights after only 3 iterations. In the testing stage, the outputs of the MLP arerounded to be the nearest integers to indicate predicted class IDs. Comparing thegiven class IDs and the predicted class IDs from the MLP in Fig. 10, it is obtainedthat 146 out of 150 iris samples can be correctly classified by such set of weightsand the percentage of correct classification is 97.3%. A single support vector ma-chine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with thesame structure, training by back-propagation (BP) achieved 96% on Iris test case.The MLP and BP are implemented using MATLAB neural network toolbox.

3.3 Micro-saccade optimization in active vision for machineintelligence

In the area of machine intelligence, active vision becomes an interesting topic. In-stead of taking in the whole scene captured by the camera and making sense of allthe information in the conventional computer vision approach, active vision agent

Page 18: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

18 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

Fig. 10 RLO performance on neural network training on Iris problem

focuses on small parts of the scene and moves its fixation frequently. Human andother animals use such quick movement of both eyes, which is called saccade [3], tofocus on the interesting part of the scene and efficiently use its own resources. Theinteresting parts are usually important features of the input, and with the importantfeatures being extracted, the high-resolution scene is analyzed and recognized withrelatively small number of samples.

In a saccade movement network (SMN) presented in [16], the original images aretransformed into a set of low resolution images after saccade movements and retinasampling. The set of images, as the sampled features, are fed to the self-organizingwinner-take-all classifier (SOWTAC) network for recognition. To find interestingfeatures of the input image and to direct the movements of saccade, image segmen-tation, edge detection and basic morphology tools [4] are utilized.

Fig. 11 (a) shows a face image from [7] with 320×240 pixels. The interestingfeatures found are shown in Fig. 11 (b). The stars represent the center of the fourinteresting features found on a face image and the rectangles represent the featureboundaries. Then, the retina sampling model [16] places its fovea at the center ofeach interesting feature, so that these features will be extracted.

Practically, the centers of the interesting features found by image processing tools[4] are not guaranteed to be the accurate centers, which will affect the accuracyof feature extraction and pattern recognition process. In order to help to find theoptimum sampling position, RLO algorithm can be used to direct the move of thefovea of the retina and find the closest match between the obtained sample featuresand pre-stored reference sample features. These slight moves during fixation to findthe optimum sampling positions can be called microsaccades in the active visionprocess, although the actual role of microsaccades has been a debate topic unsolvedfor several decades [19].

Page 19: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 19

Fig. 11 Face image and its interesting features in active vision [16]

Fig. 12 Image sampling by micro-saccade

Fig.12 (a) shows a group of ideal samples of important features in face recogni-tion. Fig. 12 (b) shows the group of sampled features with initial sampling positions.In the optimization process, the x-y coordinates need to be optimized so that thesampled images have the optimum similarity to the ideal images. The level of simi-larity can be measured by the sum of squared intensity difference [9]. In this metric,increased similarity will have decreased intensity difference. Such problem can bealso perceived as an image registration problem. The two-variables objective func-tion V(x, y), the sum of squared intensity difference, needs to be minimized throughRLO algorithm. It is noted that the only information available is that V can be the

Page 20: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

20 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

function of x and y coordinates. How the function would be expressed and what areits characteristics are totally unknown. The minimum value of the objective func-tion is not known either. RLO would be the suitable algorithm for such optimizationproblem. Fig.12 (c) shows the optimized sampled images using RLO-directed mi-crosaccades. The optimized feature samples are closer to the ideal feature samples,which will help the processing of the face image.

After the featured images are obtained through RLO-directed microsaccades,these low-resolution images, instead of the entire high-resolution face image, aresent to the SOWTAC network for further processing or recognition.

4 Conclusions

In this chapter, a novel and efficient optimization algorithm is presented for theproblems in which the objective functions are unknown. The search particle is ableto build the model of successful actions and choose its future action based on thepast exploring experience. The decisions on the step sizes (and directions) are madebased on a trade-off between exploitation of the known search path and explorationfor the improved search direction. In this sense, this algorithm falls into a categoryof reinforcement learning based optimization (RLO) methods. The algorithm doesnot require any prior knowledge of the objective function, nor does it require anycharacteristics of such function. It is conceptually very simple and intuitive as wellas very easy to implement and tune.

The optimization algorithm was tested and verified using several multi-variablefunctions and compared with several other widely used random search optimizationalgorithms. Furthermore, the training of a multi-layer perceptron (MLP), based onfinding a set of optimized weights to accomplish the learning, is treated as an opti-mization problem. The proposed RLO was used to find the weights of MLP in thetraining problem on Iris database. Finally, the algorithm is used in the image recog-nition process to find a familiar object with retina sampling and micro-saccades.

The performance of RLO, will depend to a certain degree on the values of severalparameters that this algorithm uses. With certain preset parameters, the performanceof RLO can meet our requirements in several machine learning problems involvedin our current research. In the future research, a theoretical and systematic analysisof the effect of these parameters will be conducted. In addition, using a group ofsearch particles and their cooperation and competition, a population based RLO canbe developed. With the help of model approximation techniques and the trade-offbetween exploration and exploitation proposed in this work, the population basedRLO is expected to have better performance.

Page 21: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

Reinforcement learning 21

References

1. G. Arfken, ”Lagrange Multipliers.” §17.6 in Mathematical Methods for Physi-cists, 3rd ed. Orlando, FL: Academic Press, pp. 945-950, 1985.

2. S. Belur, A random search method for the optimization of a function of n vari-ables, MATLAB central file exchange, [Online] Available:http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=100.

3. B. Cassin, S. Solomon, Dictionary of Eye Terminology.Gainsville, Florida:Triad Publishing Company, 1990.

4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, theMathworks.[Online] Available:http://www.mathworks.com/products/image/demos.html.

5. L. C. W. Dixon, G. P. Szego, “The optimization problem: An introduction,”Towards Global Optimization II, New York: North Holland, 1978.

6. J. A. Eelder, R. Mead, “A simplex method for function minimization,” TheComputer Journal, vol. 7, pp. 308-313, 1965.

7. Facegen Modeller. Singular Inversions. [Online] Available:http://www.facegen.com/products.htm.

8. X. del Toro Garcia, F. Neri, G. L. Cascella, N. Salvatore, “A surrogate associ-ated Hooke-Jeeves algorithm to optimize the control system of a PMSM drive,”IEEE ISIE, July, 2006, pp. 347-352.

9. D. L. G. Hill and P. Batchelor, “Registration methodology: concepts and al-gorithms,” Medical Image Registration, J. V. Hajnal, D. L. G. Hill, and D. J.Hawkes, Eds. Boca Raton, FL: CRC, 2001.

10. R. Hooke, T. A. Jeeves, “Direct search solution of numerical and statisticalproblems,” Journal of the Association for Computing Machinery, vol. 8, pp.212-229, 1961.

11. J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proc. IEEEInt. Conf. Neural Networks, vol. 4, pp. 1942–1948, Perth, Australia, Dec. 1995.

12. H. Kim, S. Pang, H. Je, “Support vector machine ensemble with bagging,”Proc. of 1st Int. Workshop on Pattern Recognition with Support Vector Ma-chines, SVM’2002, Niagara Falls, Canada, August 2002.

13. S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi, “Optimization by simulated an-nealing,” Science, vol. 220, no. 4598, pp. 671-680, 1983.

14. A. Leontitsis, Hybrid Particle Swarm Optimization, MATLAB central file ex-change, [Online] Available:http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do? objec-tId=6497.

15. R. M. Lewis, V. Torczon, and M. W. Trosset, “Direct search methods: Thenand now,” Journal of Computational and Applied Mathematics, vol. 124, no. 1,pp. 191-207, 2000.

16. Y. Li, Active Vision through Invariant Representations and Saccade Move-ments, Master thesis, School of Electrical Engineering and Computer Science,Ohio University, 2006.

Page 22: A NOVEL OPTIMIZATION ALGORITHM BASED ON …oucsace.cs.ohio.edu/~starzyk/network/Research/...multivariable function, an efficient search procedure can be performed using only historical

22 Janusz A. Starzyk, Yinyin Liu, Sebastian Batog

17. Y. Liu, J. A. Starzyk, Z. Zhu, “Optimized Approximation Algorithm in NeuralNetworks without overfitting”, IEEE Trans. on Neural Networks, vol. 19, no.4, June, 2008, pp. 983-995.

18. I. J. Lustig, R. E. Marsten, D. F. Shanno, “Computational Experience with aPrimal-Dual Interior Point Method for Linear Programming”, Linear Algebraand its Application. vol.152, pp.191-222, 1991.

19. S. Martinez-Conde, S. L. Macknik, D. H. Hubel. “The role of fixational eyemovements in visual perception”. Nature Reviews Neuroscience, vol. 5, no. 3,pp.229-240, 2004.

20. Yew-Soon Ong, “Max-min surrogate-assisted evolutionary algorithm for ro-bust design,” IEEE Trans. on Evolutionary Computation, vol.10, no. 4, August2006, pp. 392-404.

21. M. J. D. Powell, “An efficient method for finding the minimum of a function ofseveral variables without calculating derivatives,” The Computer Journal, vol.7, pp. 155-162, 1964.

22. R. A. Fisher, Iris Plants Database, [Online] Available:http://faculty.cs.byu.edu/˜cgc/Teaching/CS 478/iris.arff, July, 1988.

23. H. H. Rosenbrock, “An automatic method for finding the greatest or least valueof a function,” The Computer Journal, vol. 3, pp. 175-184, 1960.

24. B. V. Sheela, “An optimized step-size random search,” Computer Methods inApplied Mechanics and Engineering, vol. 19, no. 1, pp. 99-106, 1979.

25. J A. Snyman, Practical Mathematical Optimization: An Introduction to Ba-sic Optimization Theory and Classical and New Gradient-Based Algorithms.Springer Publishing, 2005.

26. J. A. Starzyk, ”Motivation in Embodied Intelligence” in Frontiers inRobotics, Automation and Control, I-Tech Education and Publishing, Oct.2008, pp. 83-110. [Online] Available:http://www.intechweb.org/book.php?%20id=78&content=subject&sid=11

27. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MITPress, Cambridge, MA, 1998.

28. V. Torczon, “On the Convergence of Pattern Search Algorithms,” SIAM Journalon Optimization, vol. 17, no. 1, pp. 1-25, 1997.

29. J. Vandekerckhove, “General simulated annealing algorithm” MATLAB centralfile exchange, [Online] Available:http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=10548.

30. T. J. Ypma, Historical development of the Newton-Raphson method, SIAMReview vol. 37, no. 4, pp. 531–551, 1995.