Search for Maximal Snake-in-the-Box Using New Genetic Algorithm

Search for Maximal Snake-in-the-Box Using New Genetic Algorithm

Kim-Hang Ruiz International MIS

1065 Waltons Pass Evans, GA 30809

[email protected]

ABSTRACT The Snake-In-The-Box (SIB) problem is a challenging combinatorial search problem to find the longest constrained open path (k-spread snake) in n-dimensional hypercube (Qn). In addition to constructive techniques, many search algorithms such as Depth First Search (DFS), Genetic Algorithm (GA), hybrid Evolutionary Computation algorithm (EC), and Nested Monte-Carlo Search (NMCS) have been used to tackle this problem. To get better results and to speed up the process, these techniques often used a long snake as the starting point for the search (priming/seeding).

This paper reviews the hypercube fundamentals, then presents a new search technique, Mitosis Genetic Algorithm (MGA), which was applied in search for the four different spread snakes (spread 2 to 5) in seven different dimensional hypercubes (Q6 to Q13). The MGA found three new record-breaking 3-spread snakes in Q10, Q11 and Q13, all the previously known optimal snakes from spread 2 to spread 5, and the best previous known maximal 3-S9 snake of length 63. It is remarkable that it found those within minutes to hours without priming, significantly shorter than days to weeks needed in the other techniques.

Categories and Subject Descriptors I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search backtracking, graph and tree search strategies, heuristic methods, plan execution, formation, and generation.

General Terms Algorithms, Theory

Keywords Genetic Algorithm, Mitosis Genetic Algorithm, Heuristics, Snake, Hypercube

1. INTRODUCTION The n-dimensional hypercube Qn is a graph whose vertices consist of all binary strings bn-1b1 b0 where each vertex connects to n adjacent vertices. Two vertices are adjacent if and only if their binary strings differ by exactly one bit. By definition a k-spread snake is an induced path in an undirected graph G, such that every pair of vertices of a least distance k in the path are also at a least distance k in graph G. The Snake-in-the-Box (SIB) problem, to find the longest k-spread snakes in Qn (k-Sn), has been a challenge for researchers over the past fifty years.

The properties of the SIB have been applied in error-detecting codes, in analog-to-digital encoders and converters [5, 7], modulation schemes in multi-level flash memories [16], efficient resource distributions in high speed computer clusters [11], and in identifying gene regulation networks in embryonic development [17]. The accuracy of error detection in certain analog-to-digital conversion systems increases as the length of the snake increases; the greater the spread, the greater is the error-detection capability. Accordingly, there has been much interest in finding the longest maximal k-Sn.

When a snake cannot be extended at either end, it is maximal. The number of maximal snakes vary with the hypercube dimension and the spread. The longest maximal k-Sn is the optimal. Many mathematicians have suggested several constructive techniques [1, 2, 12, 15] to build the longest maximal k-Sn. Other scholars have either applied heuristic search such as GA [2, 13], EC hybrids [3], and NMCS [8], or exhaustive search such as DFS [6, 10] to find the optimal or the longest maximal k-Sn. These search techniques can find the optimal in the small dimensional hypercubes Qn; but not in the large ones. Most record-breaking k-Sn (for large Qn) reported in literature have not been proven to be the optimal. The GA holds the record in finding the maximal 2-S8 of length 98 [2] and the NMCS in 2-S10, 2-S11 and 2-S12 of lengths 370, 695 and 1274 respectively [8]. Only constructive techniques and DFS were used to build or to search for snakes with spreads higher than 2. To get better results and to speed up the search process, a longer snake is often used as the starting point for search (priming) in these techniques.

This paper reviews the hypercube fundamentals in section 3 and then introduces a new search technique: Mitosis Genetic Algorithm (MGA) in section 4. The MGA implementation and how the experiment was set up in search for the longest maximal k-Sn snakes are described in section 5. The search results for the four different spread snakes (spread 2 to 5) in seven different dimensional hypercubes (Q6 to Q13) are discussed in section 6 and section 7. The MGA found three new record-breaking 3-S10, 3-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GECCO '14, July 12 16, 2014, Vancouver, BC, Canada. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2662-9/14/07$15.00. http://dx doi org/10 1145/2576768 2598296

831

S11 and 3-S13 of lengths 125, 158, and 509, longer than the best previously known value, 103, 157 and 493 respectively [6]. It also found all the previously known optimal snakes from spread 2 to spread 5; and the known longest 3-S9 snake of length 63 without priming. It is remarkable that the search times to find these optimal and longest maximal k-Sn snakes took minutes or hours, significantly shorter than the days to weeks required by the other techniques. A list of transition sequences of three record-breaking snakes is provided in an appendix (section 9).

2. TERMINOLOGY AND DEFINITIONS The following standard abbreviations and terms are used in this paper.

2.1 Terminology k spread k-Sn n-dimensional k-spread snake, e.g 3-S8 denotes 8-

dimensional 3-spread snake. In the literature Sn is used for 2-Sn.

n number of dimensions. Qn n-dimensional hypercube, e.g Q9 denotes 9-dimensional

hypercube. O(n) search space. OGA(t) search time for each Genetic Algorithm in the MGA. O(t) search time for the Mitosis Genetic Algorithm. Ok(t) search time for k-spread snakes. Ok-1(t) search time for (k-1)-spread snakes

2.2 Definitions Available nodes are nodes that have distances at least equal to

spread k from any snake node except (k-1) nodes from the tail (or the head).

Available transition is the transition that leads to an available node.

Canonical form of a snake is a type that requires each transition number to first appear in a transition sequence only after all smaller transition numbers have appeared. For example, the transition sequence 01203123 is in canonical form but the transition sequence 10321032 is not.

Chromosomes represent possible solutions to the genetic algorithm problem at hand.

Free neighbors are neighbors that are available nodes. Fitness is the evaluation of a chromosome as a solution to the

given genetic algorithm problem. An induced path in an undirected graph G, is a sequence of

nodes P, where every pair of non-consecutive vertices in P are non-adjacent in G.

A maximal snake is not a subsequence of any other snake, which cannot be extended at either end without violating the non-chord constraint.

The n-dimensional k-spread snake k-Sn is an open induced (chord free) path in an undirected graph G such that every pair of vertices at distance at least k in the path are also at distance at least k in G.

A node sequence of a k-Sn snake is the sequence of n-code-words which consist of all binary strings bn-1b1 b0 and two words are adjacent if and only if their binary strings differ by exactly one bit.

The optimal snake is the longest maximal snake proven. Population is a collection of chromosomes in a genetic

algorithm problem.

Spread k is the distance between two nodes that are at least k units apart.

A transition sequence of a k-Sn snake is the sequence of integers ranging from 0 to (n-1), which corresponds to the bit positions (index) in which successive ncode-words differ.

The Vicinity Vk consists of all nodes in Qn that have a distance k from the snake path.

Unavailable nodes are nodes that have distances less than k units from any snake node (in the vicinity V1 or Vk-1).

Unavailable transition is the transition that leads to an unavailable node.

2.3 Font Style Italic font represents variable name in programming or in

calculating the search space and the search time of the algorithm.

Bold underlined italic font represents definitions.

3. HYPERCUBE FUNDAMENTALS One of the most important properties of the hypercube is that the number of vertices in Qn is always twice as much as in Qn-1. Figure-1 shows how the sole node in Q0 duplicates itself by translating in coordinate 0 (transition 0), to form a line in Q1 which continues to duplicate in transitions 1, 2, and 3 to form Q2, Q3, and Q4 respectively.

Each vertex in Qn connects to n other vertices. If there is no induced path in the graphs, every vertex has an equal chance to be selected as a snake node based on the isomorphic symmetries of the hypercube. If, however, there is an induced path (snake) in the graph already, only some nodes can be selected to extend the path. The conditions for a node to be selected are that, it must be available and adjacent to the end nodes (tail or head). For example, an induced path 3-S6 in Q6 with transition sequence 0123 is illustrated with the thick black line in Figure-2. Red circle vertices represent unavailable nodes in vicinity V1 that are one distance away from snake nodes. Green triangle vertices are unavailable nodes in vicinity V2. Black square vertices are available nodes that can be added to the path. There are 26 available nodes in Figure 2 but only 3 of them (marked e0, e4, and e5) are adjacent to the tail and can extend the path. Thus the next transition for the induced path (snake) can only be either 0, 4 or 5.

The left half of the graph can be considered as Q5 (Qn-1) where the most significant bit of all nodes is equal to 0. The right half of the

832

graph is also considered Q5 (Qn-1) with the most significant bit equal to 1. Some of the unavailable nodes can be connected to form some paths which replicate part of the snake path. Notice that:

1) Path of unavailable nodes in vicinity V1 (red lines connecting

red circles) on the right half of figure replicates the snake path on the left half minus 1 node.

2) Paths of unavailable nodes in vicinity V2 (green lines connecting green triangles) on the right half of figure replicates the path of unavailable nodes in vicinity V1 in the left half of figure minus 1 node.

3) There are no unavailable nodes on the right half figure corresponding to the unavailable nodes in vicinity V2 on the left half. This is because the nodes on the right half are in vicinity V3 which are available nodes according to the definition of 3-Sn.

4) There is one available node in the right half figure corresponding to each available node in the left half figure.

The first two notices indicate that a tight k-Sn-1 snake with less unavailable nodes in vicinity V1 would be more likely to produce a longer k-Sn snake. The third notice indicates that the number of unavailable nodes in vicinity Vk-1 can be used to distinguish the ability of a k-Sn-1 snake in extending into a longer k-Sn snake. If two snakes k-Sn-1 have the same length but different number of unavailable nodes in vicinity Vk-1, the snake with the larger number would have higher ability to extend into a longer snake (since nodes in Vk-1 will become available nodes in Qn). The fourth notice indicates that there are 2 available nodes in Qn for every available node in Qn-1. These important properties will be incorporated in calculating the fitness function in the MGA.

The replication of the snake path and unavailable paths in the left half of the hypercube (Qn-1) into the right half, in preparing for extending k-Sn snakes in Qn is analogous to the natural mitosis

process in biology where DNA replicates to prepare for cell duplication. A cell with good DNA grows into many more cells with good DNA. Imitating this natural process, Mitosis Genetic Algorithm (MGA) was developed with the assumption that a long k-Sn-1 can extend into a long k-Sn.

4. MITOSIS GENETIC ALGORITHM Mitosis Genetic Algorithm is a heuristic search technique applying a series of Genetic Algorithms to many parts of the problem at hand to find the optimal or near optimal solutions for each part which will then be combined or extended to ultimately find the optimal solution to the problem. Genetic Algorithms are heuristic search methods modeled from the theory of natural evolution. The process in GA involves generating a population of chromosomes (possible solutions to the problem) which will then be modified by the genetic operators, namely, selection, crossover, and mutation, to create a new generation. This evolutionary process is repeatedly carried out a predetermined number of times for each GA.

4.1 Encoding Schemes Depending on the nature of the given problem, many encoding techniques can be used. Normally, the chromosomes in the population can be represented as strings of binary digits, integers, real numbers, or ordered list. The bit string representations are more commonly used because they are simple to create and easy to manipulate. It is recommended that the representations in all GAs are the same to ease the later process of combining or extending partial solutions into the solution of the problem.

4.2 Objective Function Objective Function is a formula used to evaluate the fitness of a chromosome as a solution to the problem at hand. The formula can be mathematical or non-mathematical depending on the problem. Choosing an appropriate and effective objective function in each GA is very crucial for MGA to find the efficient solution of any given problem.

4.3 Selection Based on the fitness which is evaluated by the objective function, individual chromosomes with a higher value will have a better chance to be selected as a parent to produce offspring of the next generation. There are many different methods to select, the two most common are: weighted Roulette-Wheel, and Tournament. In Roulette-Wheel selection, the probability of a member being selected is proportional to its fitness. In Tournament selection, four or six members in the current population are randomly selected; and the one best fitted will become parent (Tournament-4 or Tournament-6). Two selected parent chromosomes may or may not be replaced back to the current population. If it is, the selection is with replacement, otherwise, selection without replacement. In the selection with replacement, the fitness of the parent is unchanged after being selected. In the selection without replacement, the fitness of the parent is set to 0 to eliminate the chance to be selected again. It is also recommended that the selection types in all GAs be the same.

4.4 Crossover Each time two parents are selected, their chromosomes are interchanged to generate new chromosomes that are different but

833

retain some of their characteristics (crossover operator). A number from 0 to 1 is randomly generated and the crossover operator is then carried out only if the generated number is smaller than the predetermined crossover probability. There are two basic crossover techniques: one-point crossover and two-point crossover, which can be utilized by MGA. Depending on the type of crossover, one or two numbers between 1 and chromosomes length are randomly selected as the crossover point or points. In one-point crossover, two parent chromosomes beyond the crossover point are swapped, rendering two children. In two-point crossover, the part of the chromosome between the two crossover points are swapped to create two children. The two children chromosomes will then be copied to the new population. If the crossover operation is not carried out, the two parent chromosomes will be copied instead. The type of crossover can be different for each GA. Using two-point crossover in MGA tends to find the efficient solution to the problem more often than using one-point crossover.

The predetermined crossover probability can be different in each GA, but it is often set to be the same.

4.5 Mutation A common single point mutation is used in MGA. It involves a probability that an arbitrary gene in the chromosomes of the new population will be changed from its original state. Each gene in the chromosomes of the new population is visited by the algorithm, and a number from 0 to 1 is randomly generated. If the generated number is smaller than the predetermined mutation probability, the gene will be replaced with a different gene. The predetermined mutation probability can be different in each GA; but it also is often set to be the same.

4.6 Repeating Time Since the random function is called on many times in GAs procedures, it is likely to produce a different solution in every run. In order to increase the reliability of producing the same result and the chance to get the optimal solution, the MGA procedure was set to repeat a predetermined number of runs. The best solution found in each run is stored in a variable local-best. The best of all the local-best values is the solution. For example, if the repeating time is set to 5 and the local best values are 46, 47, 48, 47, and 49, then the solution is 49. Thus the higher is the repeating time, the higher is the reliability, but the longer is the searching time.

A hypercube Qn consists many lower dimensional hypercubes, thus finding the optimal k-spread snake in Qn can be viewed as finding many optimal k-spread snakes in many lower dimensional hypercubes Qn-2 or Qn-1 which will then be combined or extended to obtain the optimal k-spread snake in Qn. Thus MGA is well suited to tackle SIB problem.

5. MGA IMPLEMENTATION Figure 3 illustrates the MGA procedure in searching for the longest maximal k-Sn. First, an initial population of transition sequences of k-Sn-1 is generated. The GA is applied to it to find a population of near optimal k-Sn-1 which will then be replicated and extended to form an initial population of transition sequences of k-Sn. The GA will be applied again to this new initial population to find the longest maximal k-Sn, the solution to the problem. The first initial population can be from any lower

dimensional hypercube that is different than Qn-1. For example, a series of GA can be applied to an initial population of transition sequences of k-Sn-2 instead. In this case, the replications and extensions must be repeated twice and the GA must be applied three times to evolve the populations of transition sequences of k-Sn-2, k-Sn-1 and k-Sn.

5.1 Representation To avoid adjacency violations during crossover and mutation operators, the transition sequence is used as the representation for each individual chromosome in the population. Each transition sequence is stored in a one-dimension array whose length is at least 5 units longer than the best previously known value. For example, the best previously known values for 3-S8 and 4-S8 snakes are 35 and 19, thus the length of chromosomes for these snakes are set at 40 and 24 respectively. When a new record-breaking k-Sn is found, the number of available nodes will be determined. If there are m available nodes left, the array length for this k-Sn will be reset to the sum of the length of the longest found maximal plus m. For example, the array length for 3-S10 is initially set at 108. When a maximal with length 108 is found, there are 17 available nodes left. Thus the array length is reset to 125 and the MGA is rerun with the same settings.

5.2 Initial Population By definition a k-spread snake is an induced path in an undirected graph G such that every pair of vertices of at least distance k in the path are also at least distance k in G. This indicates that if C is the transition sequence for a k-Sn snake with length L > 2k, then all the numbers in every block of (k + 1) consecutive numbers of C are distinct (Douglas Remark [4]). For example, the four transitions of the 3-S6 snake in Figure 2 are 0123, the integer that can be assigned to the fifth gene can only be either 0, 4, or 5. If the integer 0 is chosen for the fifth, then the integer that can be assigned to the sixth must be either 4, or 5. Lets assume that the transition sequence 012304 is formed. It can be seen that every four consecutive numbers in the transition sequence are distinct. Thus the following procedure is used to generate each chromosome in an initial population of snakes in canonical form. 1) Assign integers from 0 to k to the first (k + 1) genes. 2) Randomly assign an integer that is different than k previous

genes for the next gene. 3) Repeat step 2 until all genes are assigned.

5.3 GA Operators In this study, both Tournament-4 and Tournment-6 selections with and without replacement are used. Both one-point crossover and two-point crossover are utilized. Since the representation of the chromosome is the transition sequence, there is no special works needed to perform during or after the crossover operators to keep the node adjacency intact. Single point mutation is used.

834

There are two different procedures to carry out these GA operators. First procedure uses the selection with replacement, and all GA operators are carried out repeatedly until the entire new population is created. Second procedure uses the selection without replacement to select members from the current population and then directly copied them to the first sixty percent of the new population. The single point mutation is then applied to this part of the new population. After that, each pair of mutated members will be selected in order from the top down to be crossed over and copied to the remaining 40 percent of the new population. For example, the crossover will be carried out on the first and the second members in the new population to create the 61st and 62nd members. Ultimately, the fitness function is carried out before and after crossover as well as mutation in the second procedure.

5.4 Fitness Function After a new population is created, a transition sequence of each member is mapped to node sequence before calculating snake length. The function to calculate the snake length is outlined below:

1. Set one-dimensional array Status of length 2n to zero which represents available nodes.

2. Mark Status of the first node in the chromosome as n which represents snake nodes. Set variable count that represents snake length to 0.

3. Check whether Status of the next node in the chromosome is available. If it is not, return count; otherwise, increase count.

4. Mark Status of the next node as n (represents snake node) and of its neighbors within (k-1) distances to an integer that represents its distance from the snake node, e.g. mark 1 for 1 distance, (k-1) for (k-1) distances

5. Return to step 3.

The snake length calculation stops when the next transition in the chromosome leads to an unavailable node. Since the subject of this study is to find the longest maximal k-Sn snake, an extension procedure will be applied to the snake to determine whether it could be extended and if so what the extended length would be. The extension procedure begins with the calculation of the number of available transitions from the snake tail. If the number of available transitions is zero, the snake is maximal (cant be extended), and the length of the maximal k-Sn snake can be reported in the fitness function. If the number of available transition is greater than zero, replace the unavailable transition with an available transition. Either one of four extension procedures listed below can be used to select an available transition:

Extension F: selects the available transition that has the lowest number of free neighbors.

Extension H: selects the available transition that has the highest coordinate.

Extension L: selects the available transition that has the lowest coordinate.

Extension R: randomly selects one available transition from the set of available transitions.

For example, the snake with the transition sequence 0123 in Figure 2 can only be extended with either the available transition 0, 4 or 5. The available transition 0 has the lowest index (coordinate) and the transition 5 has the highest index. If the snake is extended with the extension L, the transition 0 will be

selected. If the snake is extended with the extension H, the transition 5 will be selected. If the snake is extended with the extension R, either 0, 4 or 5 will be selected (based on a generated random number). The transition 0 leads to the node e0 that has 2 free neighbors n0, both transitions 4 and 5 lead to nodes e4 and e5 that have 3 free neighbors n4 and n5 respectively. Thus if the snake is extended with the extension F, the transition 0 will be selected since it has the lowest free neighbors.

The variable count increases by an increment each time an available transition is selected. The extension stops when available transition is no longer found and the value of the variable count is reported as length L in the fitness function.

Since the MGA is a series of GA applied to different populations of k-Sn-1 and k-Sn transition sequences in search for the longest maximal k-Sn, the fitness function for each GA will be different. Depending on the population where the GA is applied to, either one of the two fitness functions below is used:

Fitness = length + E (1)

Fitness = length (2)

Where E is the least potential extended snake nodes in the next-higher-dimensional hypercube. Function (1) is used in GAs to evolve k-Sn-1 populations in lower dimensional hypercubes, which will be used further to extend into k-Sn population. Thus this function must account for the ability of k-Sn-1 to extend into k-Sn. Each time that one more node is added to the snake path in the extension procedure, (n-k)*(k-1) nodes will become unavailable. Based on the notices in the hypercube fundamentals above, the total available nodes in Qn is equal to the sum of all unavailable nodes in vicinity Vk-1 plus two times the available nodes in Qn-1. The least potential extended snake nodes in the next higher dimensional hypercube E roughly equals to the ratio of the total available nodes in Qn over (n-k)*(k-1). Therefore

E = (2*available nodes + unavailable nodes in Vk-1) / [(n-k)*(k-1)]

Function (1) will distinguish between two equal-length k-Sn-1 snakes that can extend into two different length k-Sn snakes by using E.

Function (2) is used in the last GA to evolve the k-Sn population which contains the solution to SIB problem. Thus Function (2) only utilizes the length of the longest maximal k-Sn.

5.5 Experiment Settings In this study, the population sizes were initially set at 500, 1000, 1500, 2500, and 3000; and the number of generations at 50, 70, 80, 90, 110, 120, and 150. When the length of the k-Sn being searched for is found to be shorter than the best previously known value, the population size were increased to 5000, 10000, 20000, and 30000; and the number of generations to 170, 200, and 300 in an attempt to extend few more nodes. This action was taken based on the assumption that the larger the population and the number of generations, the better is the chance to find the longest maximal. The crossover probability was set at 0.6 and 0.8. The mutation probabilities were set at 0.02, 0.01, 0.005, 0.009, and 0.0001. The number of generations was set to 50 in GAs in search for better k-Sn-1 populations (partial solutions). The repeating time was initially set at 20, 50, 100 and 200. Later, only the repeating time of 20 is used in high Qn, (n>10).

835

The MGA was written in VB.net (using Microsoft Visual Studio 2008 package) and run on five processors with 2.0 GHz Pentium Dual-Core CPU. 6. RESULTS A summary of the results is given in Table 1, where the lengths of the best known k-Sn values for dimension n 13 and spread k 5 are listed. The values in parentheses are the best previously known values published in [2, 6, 8, and 15]. ]. Recent tests at http://ai.uga.edu/sib/sibwiki/doku.php/records discover longer 2-S11, 2-S12 and 2-S13 but results have not been published yet.

Table 1. Longest known k-Sn snakes. (Best previously known values in parentheses)

Dimension n

Spread K 2 3 4 5

6 (26*) (13*) (8*) (7*) 7 (50*) (21*) (11*) (9*) 8 (98) (35*) (19*) (11*) 9 (190) (63) (28*) (19*)

10 (370) 125 (103) (47*) (25*) 11 (695) 158 (157) (68) (39*) 12 (1274) (286) (104) (56) 13 (2466) 509 (493) (181) (79)

* optimal value

The Mitosis Genetic Algorithm found three new record-breaking 3-S10, 3-S11 and 3-S13 of lengths 125, 158 and 509 within 7, 17 and 87 minutes of search time respectively. The previous best known values are 103, 157 and 493 which were found by starting from a longest known coil and backtracking for seven days [2]. It is worth noting that the MGA was able to find these records relatively quick without seeding. A list of transition sequences of these three record-breaking snakes is provided in the appendix.

The MGA also found all the previously known optimal k-Sn snakes and the best previously known maximal 3-S9 of length 63. Any optimal snake of length shorter than 26 was found within a few seconds and longer ones within a few minutes. DFS took only few milliseconds to find the former and days to find the latter [2, 8]. The MGA found the optimal 2-S7 of length 50 within 5 minutes, much shorter than the days needed by the GA in [13]. These results indicate that the DFS is best suited in search for the optimal(s) shorter than 27 and the MGA for the optimal and longest snakes longer than 26. The MGA found the nearest values to the best previous known maximal 2-S8, 4-S11, 4-S12, 5-S10, 5-S11, 5-S12 within minutes to hours. In the attempt to reach the best previously known value, the search for these snakes were repeated with much larger population sizes and larger numbers of generations. The results showed that the snake lengths did not improve and in some cases were actually shorter. Thus, the assumption that the larger the population size and the number of generations, the better is the chance to find the longest maximal is not always true, especially in the excessively large population sizes. This might have been due to the convergence in the selection operator. When the population size is too large, duplicate members are more likely to happen. Duplicate better fit members will quickly flood the next population which causes premature convergence, and the snake closest to the longest will consequently be found instead of the longest. This phenomenon happens more often in the

selection with replacement than the selection without replacement, and especially in high dimensional Qn.

Even though the MGA found the optimal 2-S7 of length 50 and the maximal 2-S8 of length 97 within 5 minutes, it could not find the maximal 2-Sn >8 snakes near the previously known records. It found 2-S9, 2-S10, 2-S11, 2-S12 and 2-S13 of lengths 185, 350, 595, 1033, and 1887 while the previously established records are 190, 370, 695, 1274, and 2466 respectively. This indicates that in order to be more effective in search for these 2-spread snakes, the parameter settings in GAs, the fitness functions and/or the MGA procedure should be modified.

7. DISCUSSION It is noticed that the extension type, the length of chromosomes, the number of replication, the population size, the number of generations, the probability of crossover and mutation, all affect the MGAs ability to find the longest k-Sn snakes.

7.1 The Effect of the Extension Types The results show that different extension procedures have different effects in finding the longest snakes in different spreads. Extension R (randomly selects an available transition to extend) can find longer 2-S8 snakes than the other extensions can. However, for Qn with n > 8, the extension F (extending with the available transition that has the lowest number of free neighbors) can find longer 2-spread snakes than the others can.

Extension L or H (extending with the available transition that has the lowest or highest index respectively) can find much longer 3-Sn snakes than Extension F or R can. The record-breaking 3-S10 was found by either using Extension L or H. The record-breaking 3-S11, and 3-S13 were found by using Extension L only. It is noticed that a special repeating pattern (a _ b _ a _ b) occurs in the transition sequences of the 3-S10, and 3-S13 produced by Extension L and H, but does not occur by Extension F and R (where a b and the underscore _ is different than a, and b).

Similar effect of the extension types on spread-2 was found on spread-4: Extension R can find longest 4-Sn snakes in low Qn and Extension F in high Qn. In all cases, the extension R tends to find longer 5-Sn snakes than the other extensions can.

In summary, Extension F works best for spread-2 and spread-4, H and L extensions for spread-3, and Extension R best for spread-5 in the search for the longest maximal k-spread snakes in high Qn.

In most cases, Extension H tends to find shorter k-Sn snakes than the others, except in searching for 3-Sn. To understand why, lets examine the snake with the transition sequence 0123 in Figure 2. If the extension H is applied, transition 5 will be selected and the next extending node will be e5 which has 3 free n5 neighbors. One of these n5 nodes (marked n5 and n4) will subsequently become a snake node in the next extension and the other two nodes will become unavailable. Similarly, if the snake is extended with Extension L or F, transition 0 will be selected and the next extending node will be e0 which has 2 free n0 neighbors. One of these n0 nodes (marked n0 and n4) will subsequently become a snake node and the other will become an unavailable node. At this state, two 3-Sn snakes have the same length but different number of available nodes. The snake extended with H has one less available node than the others. If the extension repeats many times as often happens in high dimensional Qn, H will end up with much less available nodes. Thus, the snake

836

repeatedly extended by H will more likely be shorter since the snakes ability to extend depends on the amount of available nodes.

7.2 The Effect of the Number of Replications In this experiment, the optimal 2-S6 of lengths 26 and the optimal 2-S7 of lengths 50 can only be found when the number of replications is set to 0 and 1 respectively. The maximal 2-S8 of length 97 and 2-S9 of length 185 can only be found when the number of replications is set to 2. This correlation between the number of replications and the length of the snake being searched for are also observed in other spread. In general, in order to find the optimal, or the longest maximal k-Sn snakes, the number of replications should be set based on the length of the snake being searched for. Zero should be set for lengths less than 26, one for lengths between 27 and 65, and greater than one for lengths greater than 65.

7.3 Search Time In general, the search time for k-Sn snakes is longer than the search time for k-Sn-1 snakes because the length of k-Sn snakes are longer and the number of vertices in Qn is twice as much as Qn-1. However, the results show that the search time for k-Sn-1 snakes with large population size is generally longer than the search time for k-Sn snakes with lower population size. The results also show that for the same settings, the search time for k-Sn snakes is more than double the search time for k-Sn-1 snakes, even though the number of vertices in Qn is twice as much. For example, with setting of population size at 5000 and number of generations at 180, the search time [O(t)] for finding 2-S13 is about 15 hours, while the search time for finding 2-S12 is 4 hours. It is almost four times more.

Lets look at the search space and the search time of the MGA to understand why. The search space is clearly proportional to the population size p, the number of generation g, and the repeating time r. Thus

O(n) = pgr Each time a new generation is built, the selection, crossover and mutation operators in the GA are carried out, therefore the search time in the GA can be formulated as followed:

OGA(t) = prg(s+c+m) Where p is the population size. g is the number of generation.

r is the repeating time. s is the search time during the selection operator.

c is the search time during the crossover operator. m is the search time during the mutation operator.

If the selection is Tournament-6, the operator will locate the best fit chromosome out of six randomly selected chromosomes. Thus the search time for the selection is

s = 6p

Accordingly, s = 4p for Tournament-4 selection. During the crossover operator, each gene in two parent chromosomes will be copied to the next generation despite the type of crossover operator. Thus the search time for c is proportional with the length of the chromosome l and the population size p.

c = lp

During mutation, each gene is determined whether to be mutated or not, therefore the search time for m also is proportional to the length of the chromosome l and to the population size p.

m = lp

Replace s, c and m in OGA(t)

OGA(t) = prg (6p + lp + lp) OGA(t) = prg (6p + 2lp) OGA(t) = 2p2rg (3 + l)

When l is large, 3 + l ~ l, thus

OGA(t) = 2rgp2l

The MGA repeats the GA operations R times (R represents the number of replications). Since the number of generations is usually set to 50 for any dimension less than n, the search time for the MGA is

O(t) = (50*R + g) 2rp2l (3)

For any Qn, the maximum length of k-Sn snakes must be less than the depth of the search tree, 2n/k. Thus

O(t) < (50*R + g) 2rp2(2n/k) O(t) < (50*R + g) rp2(2n+1/k) (4)

Functions (3) and (4) indicate that the MGA search time is linearly proportional to the number of generations, the repeating time, the length of the chromosomes, and the square of the population size, and grows exponentially with n. Thus, when n is raised by one increment, the number of vertices double and the search time grows exponentially with n (not double).

The results show that the real search time for 4-S10 is 35 minutes, much longer than the search time for 2-S7 of 5 minutes, even though the length of chromosome representing 4-S10 (52) is slightly shorter than the length of chromosome representing 2-S7 (55) and the population size in the search for 4-S10 (3000) is smaller than the population size in the search for 2-S7 (5000). The search time functions above did not include the run time needed to mark unavailable nodes each time a snake node is assigned or extended in the chromosome. In general, the higher the spread, the more unavailable nodes need to be marked. The number of unavailable nodes that need to be marked in the MGA are:

(n - 1) for spread-2, (n - 1)*(n - 2) / 2 for spread-3, (n - 1)*(n - 2)*(n - 3) / 4 for spread-4, and (n - 1)*(n - 2)*(n - 3)*(n - 4) / 8 for spread-5.

Thus for the approximately same length of chromosomes

Ok(t) = (n - k - 1) / 2 Ok-1(t)

Where Ok(t) and Ok-1(t) are search time for k-spread snakes and (k-1)-spread snakes respectively. Thus, the search time for higher-spread snakes is much longer than the search time for the lower spread snakes even though the length of the higher-spread snakes is slightly shorter than the lower spread snakes.

The facts discussed in this section have been utilizing to modify the MGA procedure and/or alter the experiment settings to improve the speed as well as the results in search for k-Sn snakes which is the subject of another paper. For example, by applying Function (3), the search time for 2-S8 snake of length 97 can be improved from 5 minutes to 1 minute by changing the settings of

837

the population size from 5000 to 500 and of the number of generations from 150 to 400. Preliminary results are promising but more tests need to be done.

7.4 Advantages of Using MGA MGA is a very versatile genetic algorithm. It has many advantages over other GAs. It functions the same as GA when the number of replications is set to 0, while GA cannot function as MGA. In solving the same optimization problem such as SIB, MGA can utilize many fitness functions in different parts of the problem to get better results in a relatively quick run time. To find the optimization solution to the problem, GA can use only one fitness function for the whole search. Generally, the use of GA requires longer processing times and may achieve the same results as in MGA.

8. CONCLUSIONS The MGA found three new record-breaking 3-spread snakes in Q10, Q11 and Q13, all the previously known optimal snakes from spread 2 to spread 5, and the known longest maximal 3-S9 snake of length 63. It is remarkable that it found those within minutes to hours without using any longest previously known snake to seed. This proves that MGA is a very effective technique in tackling SIB problem. Modifications to the MGA procedure or settings have been researched to improve its search for 2-spread snakes in dimensions higher than 8. Preliminary results are promising but more tests need to be done before those results can be reported.

9. APPENDIX The transition sequence for each new record discovered by this study is listed below. The symbols for transition sequence are from 0 to 9 for the bit positions 0 to 9, and A to C for the bit positions 10-13. A 3-spread snake of length 125 in Q10 is presented below:

81748675837486728175847683758470827684738576847182738674857386798174867583748672817584768375847082768473857684718273867485738 A 3-spread snake of length 158 in Q11 is presented below:

0123401520613028107240132051602410329015270132041625013208140271032501426013291A024103280174201325142061230917201530241056201328041720132504236029150271032410 A 3-spread snake of length 509 in Q13 is presented below: 0712051304120516071203150412031906150214031502170A15041203150418071205130412051607120315041203190615021403150217061504120315041B0712051304120516071203150412031806150214031502170A15041203150419071205130412051607120315041203180615021403150217061504120315041C0712051304120516071203150412031906150214031502170A15041203150418071205130412051607120315041203190615021403150217061504120315041B0712051304120516071203150412031806150214031502170A150412031504190712051304120516071203150412031806150214031502170615041203150

10. REFERENCES [1] Abbott, H. L. and Katchalski, M., On the construction of

snake in the box codes, Utilitas Mathematica, 40, (1991), 97116.

[2] Carlson, B. P. and Hougen, D. F., Phenotype feedback genetic algorithm operators for heuristic encoding of snakes within hypercubes, in Proceedings of GECCO10, Portland, Oregon, ACM, (2010), 791798.

[3] Casella, D. and Potter, W., New lower bounds for the snake-in-the-box problem: Using evolutionary techniques to hunt for snakes and coils, in Proceedings of the Florida Artificial Intelligence Research Society Conference, (2005).

[4] Douglas, R. D., Some Result on the Maximum Length of Circuits of Spread K in the d-Cube, Journal of Combinatorial Theory 6, (1969), 323-339.

[5] Hiltgen, A. P. and Paterson, K. G., Single-track circuit codes, IEEE Transactions on Information Theory, 47, (2000), 25872595.

[6] Hood, S., Recoskie, D., Sawada, J., and Wong, C., Snakes, coils, and single-track circuit codes with spread k, Journal of Combinatorial Optimization (May 2013).

[7] Kautz, W. H., Unit-distance error-checking codes. IRE Trans Electronic Computers, (1958), 179180,.

[8] Kinny, D., A New Approach to the Snake-In-The-Box Problem, European Conference on Artificial Intelligence (ECAI), (2012).

[9] Klee, V., The use of circuit codes in analog-to-digital conversion. In Graph Theory and its Applications. B. Harris, Ed. New York: Academic, (1970).

[10] Kochut, J., Snake-in-the-box codes for dimension 7, Combinatorial Mathematics and Combinatorial Computations, 20, (1996), 175185.

[11] Livingston, M. and Stout, Q. Distributed resources in hypercube computers. In Proceedings 3rd Conference on Hypercube Concurrent Computers and Applications, ACM, (1988), 222-231.

[12] Paterson, K. G. and Tuliani, J. Some new circuit codes, IEEE Transactions on Information Theory, 44(3), (May 1998), 13051309.

[13] Potter, W., Robinson, R., Miller, J., Kochut, K., and Redys, D. Using the genetic algorithm to find snake-in-the-box codes. In Proceedings of the 7th International Conference on Industrial & Engineering Applications of Artificial Intelligence and Expert Systems, (1994) 307-314.

[14] Rajan, D. and Shende, A. Maximal and reversible snakes in hypercubes. 24th Annual Australasian Conference on Combinatorial Mathematics and Combinatorial Computation, 1999.

[15] Wynn, E., Constructing circuit codes by permuting initial sequences, arXiv:1201.1647v, (Jan. 2012).

[16] Yehezkeally. Y., and Schwartz, M., Snake-in-the-box codes for rank modulation, (2011). Available at http://arxiv.org/abs/1107.3372

[17] Zinovik, I., Chebiryak, Y. and Kroening, D., Periodic orbits and equilibria in glass models for gene regulatory networks, IEEE Transactions on Information Theory, 56(2), (Feb 2010), 805820.

838

Search for Maximal Snake-in-the-Box Using New Genetic Algorithm

Documents

Transcript of Search for Maximal Snake-in-the-Box Using New Genetic Algorithm