Solving artificial neural networks

CIENTIFICAInvestigación

Volumen 7, número 1 enero–julio 2013, issn 1870–8196

Solving Artificial Neural Networks (Perceptrons)

Using Genetic Algorithms

LUis COPERTARi

Computer Engineering Program

Autonomous University of Zacatecas

sAnTiAGO EsPARZA ALDOnsO BECERRA

GUsTAVO ZEPEDA

Software Engineering Program

Autonomous University of Zacatecas

[email protected]


2

Introduction

Neural networks

Artificial Neural Networks are fundamentally a de-velopment of the XX century. Originally, there was some interdisciplinary work on physics, psycholo-gy and neurophysiology carried out by Hermann von Helmholts, Ernst Mach and Iván Pavlov. Such work did not involve mathematical modeling of any kind and made emphasis on general theories of learning, vision, conditioning, among others. McCulloch and Pitts (1943) were responsible during the 1940s of the first works on artificial neural net-works. They showed that artificial neurons can, in principle, compute any logical or arithmetic func-tion. Then, Hebb (1949) proposed a mechanism for learning derived from biological neurons based on Pavlov’s classical conditioning.

Rosemblatt (1958) invented during the 1950s the first practical applications of artificial neural net-works, the perceptron and its learning rule, show-ing that it can perform character pattern recogni-tion. Approximately during the same time, Widrow and Hoff (1960) introduced the linear neural net-work and the corresponding learning algorithm, both being similar to the perceptron in structure and capacity. Unfortunately, both algorithms had the same limitations, shown by Minsky and Papert (1969). They can solve only a very limited number of problems. Precisely due to the discovery of such limitations, the field of artificial neural networks suffered and was abandoned by researcher for ap-proximately a decade. Also, the lack of new ideas and computational power made research during those years difficult.

Some work, however, continued. During the 1970s, Kohonen (1972) and Anderson (1972) inde-pendently developed new neural networks that could act as memories. Grossberg (1976) was also active during those years developing self-organiz-ing networks. During the 1980s, the power of per-sonal computers and work stations began to rap-idly grow. Additionally, there were two conceptual developments responsible for the re–emergence of artificial neural networks. The first idea, presented

by Hopfield (1982), was the used of statistical me-chanics to explain the operation of a certain kind of recurrent networks that could be used as as-sociative memories. The second idea, indepen-dently discovered by several researchers, was the discovery of the backpropagation algorithm, which allowed for the first time to train multi-layer per-ceptron networks, breaking in this way the com-putational complexity discovered by Minsky and Papers in 1969. Rumelhart and McClelland (1986) were responsible for the publication of the most in-fluential algorithm of backpropagation. These two developments combined with the accessibility of the computational power reinvigorated the field of neural networks during the last two or three decades. There have been innumerable papers on new theoretical developments and applications.

Genetic algorithms

Nature used powerful means to propel the satis-factory evolution of organisms. Those organisms that are not particularly adapted for a given envi-ronment die, when those who are well adapted to live, reproduce. The children are like their parents, so that each new generation has organisms like the well adapted members of the previous generation. If the modifications of the environment are minor, species will evolve gradually with it; however, it is likely that a sudden change in the environment cause the disappearance of entire species. Some-times, random mutations occur, and even though they imply the sudden death of the mutated indi-vidual, some of this mutations result in new and satisfactory species. The publication of Darwin’s masterpiece «The origin of species based on natural selection» represented a breakthrough in the his-tory of science.

Genetic algorithms were the result of the work of Friedberg (1958), who tried to produce learning from the mutation of little Fortran programs. Since most mutations done to the programs produced inoperative code, little was the advance that could be achieved. John Holland (1975) renew this field using representations of agents with chains of bits in such way that using any possible chain an op-

Volumen 7, número 1 enero–julio 2013, issn 1870–8196 3

erational agent could be represented. John Koza (1992) has managed impressive results in complex representations of agents along with mutation and reproduction techniques were special attention is given to the syntax of the representation language.

Methodology

The first step is to do research on the topic, espe-cially considering the different solution approach-es. It is important to highlight the differences and theorize about the advantages and disadvantages of each approach. Secondly, the working hypoth-esis is stated, which will guide the research efforts. It is proposed in light of the working hypothesis a series of experiments, some of them numerical, others with hardware and software. Then, high-lighting the results, the observed issues are consid-ered to explain what is happening. Finally, conclu-sions are detailed. Such approach is the fundamen-tal approach of the scientific method, discussed by Gauch Jr. (2003) and Wilson (1990), wich is detailed as follows:

Observation. The current knowledge on percep-trons (a kind of artificial neural network) is dis-cussed and also genetic algorithms.

Hypothesis. An hypothesis considering what has been observed is stated; being in this case that it is possible to solve neural networks (perceptrons) using genetic algorithms.

Experiments. Numerical experiments were de-signed using the computer to see what results are obtained. Experiments with two approaches are carried out: first, reproducing the weight matrix horizontally; and second, reproducing the weight matrix vertically.

Theory. The results are discussed and explana-tions detailing such results are given within a co-herent contextual framework.

Conclusion. Specific conclusions are reached con-sidering which approach works best and if a mixed approach is the ideal situation.

The problem

The input vector

The problem is how to train a perceptron using genetic algorithms. A series of digitalized numbers from zero to nine (0 to 9) is used to build and test the program. Such digitalized numbers are shown in figure 1. Notice that the numbers are drawn in a grid of 6 rows by 5 columns. These numbers are transformed into a vector of 6x5 = 30 rows called p. The procedure is simple. First, take the first row. If the square is black (full), a 1 corresponds to such position. If the square is white (empty), a -1 cor-responds to such space. Then, take the second row, and so on until the final (sixth) row is reached. For example, for the number zero (0), p = [-1 1 1 1 -1 1 -1 -1 -1 1 1 -1 -1 -1 1 1 -1 -1 -1 1 1 -1 -1 -1 1 -1 1 1 1 -1]T, where the superscript T indicates transpose matrix.

Figure 1. Digitalized numbers from zero to nine (0 to 9).

This vector (p) constitutes the input vector.

The output vector

The output vector is the number represented in grid form, except that such number is transformed into a binary representation of four binary digits. For example, number zero (0) would be 0000, number one would be 0001, number two would be 0010, number three would be 0011, number four would be 0100, number five would be 0101, number six would be 0110, number seven would be 0111, num-ber eight would be 1000 and number nine would be 1001. Generally speaking, the target number is 1*a+2*b+4*c+8*d as indicated in figure 2, where a, b, c and d are binary numbers (either zero or one).

a


4

Figure 2. Target number in binary form.This target number constitutes the output layer.

The weight matrix

The weight matrix connects the input vector with the output vector. Figure 3 shows the network structure in which all the weights (wij) are shown.

Figure 3. Network structure.

We can see we have a weight matrix (W) of 4 rows and 30 columns. The input vector is a column vector of 30 inputs (thus, the multiplication Wp yields a four column vector). The bias vector corresponds to noise in the network and it is equivalent as having an ad-ditional input with a value for p equal to one. The transfer function (f), transforms the operation Wp + b into a binary form since in this case if the result is greater or equal to zero, the transfer function returns a one, if not it returns a zero. Notice that from figure 3, a1 corresponds to a, a2 corresponds to b, a3 corre-sponds to c and a4 corresponds to d in figure 2. The network equation is shown in equation (1).

a = f (Wp+b) (1)

The input file

To train the network it is required to have a data set. In this research, a data set composed of n data inputs of n p vectors is generated by using a noise random factor. For each data point (a 1 for a black square and a -1 for a white square), a random number is drawn, and if such random number (r, where 0 ≤ r < 1) is less or equal to a given value R, then the data point is flipped. In the case of the in-put file, for convenience, a -1 is not used for empty squares but a zero. Figure 4 shows a sample text file (*.snu) in which the target number is zero, the noise percentage is five and the number of different data entries is five. Notice that the first row in the sample file is the number zero itself in grid form, the second row is the target number corresponding to such grid number, the third row is the number of different entries (5 in this case) and the fourth row is the noise percentage (5 per cent in this case).

011101000110001100011000101110

0

5

5

011101000110000100011000100110

110101000110001101011000101010

111111000110001100011000101110

011101000100001100011000101110

011101000100001100101000101110

Figure 4. Sample file.

The fifth to ninth row are the data set. The num-bers in grid form corresponding to such data set are shown in Figure 5. Let n be the number of different data sets.

Figure 5. Data set corresponding to the sample file.

The output file

A population of 1024 individuals is generated (al-though the population size is one of the variables).


Each individual contains one occurrence of the weight matrix (W) and the bias vector (b). Such ma-trix and vector are initialized using small random values (between -1 and 1). Once all the calculations are completed, an output text file containing first the number of the best individual is shown; second the average for such individual of the squared differ-ence between the value given for the individual and the target number for all data sets is given; third, the weight matrix obtained; and finally the bias vector.

The solution

The procedure used in genetic algorithms is as fol-lows: 1) coding, 2) mutation, 3) evaluation, 4) re-production, 5) decoding. There is iteration between steps four and two, until at least a g percentage of the population coincides having the same result. Also, let w be the mutation rate used in the algo-rithm. Notice that equation (2) must apply at all times during the calculations, because if even 100% of the individuals (1) has the same solution, and there is a mutation on the population of wx100%, then there would only be 100%-wx100% (or 1-w) of individuals with the same solution, and if such val-ue is not greater than gx100%, the algorithm would never stop nor converge to a solution, which is illustrated in equation (3). The symbol << means sufficiently smaller than.

w + g << 1 (2)g << 1 - w (3)

To solve the problem, the five steps must be fol-lowed. Codification is easy, since it only requires creating a population of m different W and b. Such thing is done by assigning random values between -1 and 1 to different wij (for i = 1,…,4, and j = 1,…,30) of matrix W and bi (for all i = 1,…,4) in vector b.

Step two simply requires deciding whether or not any given value wij must change. Such thing is done by drawing a random number r and if such number is less than a given threshold value R, wij is changed by choosing another random number between -1 and 1.

Step three is carried out by doing the operation from equation (1), taking such value, subtracting the target value T and squaring such result, as shown in equation (4), which gives the error vector (e).

e = (a – T)2 (4)

For step four, there are two ways to proceed: the horizontal approach and the vertical approach. In the horizontal approach, a random breaking point between zero and four is obtained and two random individuals from the top xx100% percen-tile of the population are chosen for reproduction. All rows between zero and such breaking point are taken from the first individual and all rows after the breaking point plus one and 4 are taken from the second individual. Clearly, if the number obtained is four, the first individual is reproduced as it is, whereas if the number obtained is zero, the second individual is reproduced as it is. In the vertical ap-proach, the same is done, except that now the col-umns between 1 and 30 are taken and the breaking point is a number between zero and thirty.

The final step is simply taking the best value of W and b and calculating and averaging the error given the data sample and generating an output file with the results.

The computer interface

The problem is solved using Delphi. The computer interface for the sample file is shown in figure 6.

Notice the space to draw the number in grid form and the space for choosing squares as a mean of choosing a target number in binary form, where an X means a 1 and no X means a 0. The input file is shown below the target number in binary form. To the right, there is the number of entries in the sample file of different number of input vectors p (n). Also, there is the noise percentage used to change the origi-nal number in grid form to generate the n samples of different input vectors. The first button is used to create the entry or sample file and the second button is used to load a sample file previously created.

Below that is the mutation rate used during the genetic algorithm process. Also, there is the percen-


6

tile used to choose the best individuals in the popu-lation. For example, 25 per cent means that only the best quarter of the individuals in the population will be reproduced. Finally, there is the generational per-centage (g), which is the percentage of individuals with the same solution that have to be available to finish the algorithm. The next button (Solve for En-tire Population and Samples) triggers the algorithm and trains the network for all samples until 100%xg individuals are the same. The last button (Evalu-ate and Find Best Weight Matrix) simply applies equation (1) to the solution, generates the output file (*.out) and tells the user the averaged error and the number of the individual in the population with such error.

Results

Modifying the noise percentage for the input file does not make much sense, since a high noise per-centage results in images that are completely altered from the number being represented, thus becoming useless for training as a data set. Having a noise per-centage of 5 per cent is reasonable as can be seen in Figure 5. A noise percentage lower than that would result in numbers in grid form all pretty much the same, whereas a noise percentage higher than that would result in patterns that have no resemblance with the original number.

Also, playing with the mutation rate (w) and the generational rate (g) is pointless, since the values used here are reasonable according to a study on ge-netic algorithms for the traveling salesman problem carried out by Copertari (2006). A value for g = 0.70

Figure 6. Computer interface for the sample file.


(70 per cent) and a value for w = 0.05 (5%) are usu-ally good and lead to very good solutions if not the optimal one. Also, taking the top twenty fifth per-centile for reproduction is a reasonable reproduction policy. The only parameters left to play with are the sample size and the population size.

A sample size of 16, 256, and 65535 were used. Also, population sizes of 10, 50, 100, 500 and 1,000 were used. The number seven was used in grid and binary form.

Horizontal scanning

First, it is required to see the results for the repro-duction policy of horizontal split. Table 1 shows the resulting table for different sample sizes and differ-ent population sizes.

The sample size is represented by the different lines. The population size is on the X coordinate. The resulting averaged error is on the Y coordinate. Notice that the lower the sample size, the lower the averaged error. This is reasonable since the less number of data values we need to accommodate and averaged (its error), the lower the resulting av-erage will be. Also notice that as the population size increases, the averaged error decreases, which mean that larger population sizes are better when it comes to obtaining a better solution.

The table from Table 1 is plotted in Figure 7.

Figure 7. Sample size versus population size chart for horizontal split.

Vertical scanning

The reproduction policy of vertical split is different, but it seems to lead to similar results. Table 2 shows the sample size versus population size table for the vertical split strategy.

TABLE 1sAMPLE siZE VERsUs POPULATiOn siZE TABLE FOR HORiZOnTAL sPLiT

Sample Size Population Size10 50 100 500 1,000

16 6.69 5.75 3.13 2.69 1.00

256 5.84 5.82 5.59 3.95 3.24

65535 6.97 6.63 4.50

TABLE 2sAMPLE siZE VERsUs POPULATiOn siZE TABLE FOR VERTiCAL sPLiT

Sample Size Population Size10 50 100 500 1,000

16 5.50 3.44 3.69 3.25 2.63

256 6.73 3.47 3.55 3.23 3.55

65535 5.00 6.81 5.95


8

Figure 8. Sample size versus population size chart for vertical split.

These results can be plotted. Figure 8 shows the results. Notice that this time the trend is not as clear as before. However, generally speaking, larger popu-lation sizes still result in lower averaged errors. Also notice that, with some minor exceptions, smaller sample data files also result in lower averaged errors.

Discussion and conclusion

We can see from the experimental results obtained using our two Delphi programs (sAnnUGA_H and sAnnUGA_V for horizontal and vertical split, respec-tively), that larger population sizes tend to lead to better solutions, that is, the weight and bias matrix and vector, respectively, provide a number that is closer to the target number in binary form. Also, fitting larger data sets naturally lead to higher av-eraged error rates due to the fact that more data has to be considered and such data contains more noise in it.

However, the algorithm is limited to the fact that the weight matrix is only a matrix of four by thirty el-ements (plus the bias vector), which makes it difficult to learn a lot given the limited number of «synapses» available. An architecture with a hidden layer of sev-eral neurons would work much better, but training such network using genetic algorithms would not be the proper solution to the problem, since it would take much longer to solve. The backpropagation al-gorithm would have to be used instead.

References

Anderson, J.A. 1972. «A simple neural networks gen-erating an interactive memory», Mathematical Biosci-ences, Vol. 14, 197-220.

Copertari, Luis. 2006. «Resolviendo el Problema del Vendedor Ambulante con Algoritmos Genéticos», Revista de Investigación Científica, Vol. 2, No. 3.

Friedberg, R.M. 1958. «A learning machine: Part I», in IBM Journal, Vol. 2, 2-13.

Gauch Jr., Hugo G. 2003. Scientific Method in Practice, Cambridge, England: Cambridge University Press.

Grossberg, S. 1976. «Adaptive pattern classification and universal recording: I. Parallel development and coding of neural feature dectectors», Biological Cybernetics, Vol. 23, 121-134.

Hebb, D.O. 1949. The Organization of Behavior, New York, UsA: Wiley.

Holland, J.H. 1975. Adaptation in Natural and Artificial Sys-tems, University of Michigan Press.

Hopfield, J.J. 1982. «Neural networks and physical systems with emergent collective computational abilities», Proceedings of the National Academy of Sciences, Vol. 79, 2554-2558.

Kohonen, T. 1972. «Correlation matrix memories», IEEE Transactions on Computers, Vol. 21, 353-359.

Koza, J.R. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection, MiT Press, Cambridge, Massachusetts.

McCulloch, Warren & Walter Pitts. 1943. «A logical calculus of the ideas immanent in nervous activ-ity», in Bulletin of Mathematical Byophisics, Vol. 5, 115-133.

Minsky, M. & S. Papert. 1969. Perceptrons, Cambridge, MA: MiT Press.

Rosemblatt, F. 1958. «The perceptron: a probabilistic model for information storage and organization in the brain», Psycological Review, Vol. 65, 386-408.

Rumelhart, D.E. & J.L. McClelland (editors). 1986. Paral-lel Distributed Processing: Explorations in the Microstruc-ture of Cognition, Vol. 1, Cambridge, MA: MiT Press.

Widrow, B. & M.E. Hoff. 1960. «Adaptive switching circuits», 1960 IRE WESCON Convention Record, New York: iRE Part 4, 96-104.

Wilson, Edgar Bright. 1990. An Introduction to Scientific Research, New York, UsA: Dover Publications, Inc.

Solving artificial neural networks

Documents

Transcript of Solving artificial neural networks