Genetic Algorithms

Polish Your Chromosomes

by Matt Neuburg

Matt Neuburg is the author of REALbasic: The Definitive Guide, and a member of the editorial board of REALbasic Developer. This article was originally published in REALbasic Developer Issue 1.6 (June/July 2003).

Genetic algorithms are the subject of a great deal of research and scientific literature (see reference 1). I don’t know enough about them to have an opinion as to whether they should truly be considered algorithms at all, nor do I know whether they can really be useful problem-solving devices or whether they are more a kind of game or simulation. But they are certainly an intriguing topic, and lots of fun to play with; and there isn’t the slightest reason why REALbasic shouldn’t be your playpen.

My own mild foray into the world of genetic algorithms was sparked by an online article by Teodor Zlatanov implementing some genetic algorithms in Perl (see reference 2). At first I merely wanted to translate Zlatanov’s implementation into REALbasic. But I soon discovered some incoherencies in his algorithm, and then, in the course of correcting these, found myself exploring genetic algorithms a little more deeply.

A genetic algorithm is a simulation of Mendelian/Darwinian evolution. Start with a population of individuals, each endowed with some DNA. We regard each individual’s DNA as a possible answer to a problem, and we rank it according to how good an answer it is. We take the individuals with the best DNA and mate them, generating a new population, repeating the process again and again. The idea is that over many generations the population’s DNA should be greatly improved, with the best DNA representing a much better answer to our problem. This answer, you understand, will not be the product of our own devising, but of the rules of selection and breeding, combined with the randomness inherent in those rules. Evolution itself will “find” an answer.

My example, which comes directly from Zlatanov, uses a bytestream as the individual’s DNA. We translate every byte into a character, either a digit (0 through 9) or a capital letter (A through Z) or a space, thus ending up with a sequence of “words”, which we enter into an array; we then evaluate the DNA by treating this array, word by word, as follows.

The DNA’s initial value is zero. A word like “9BJ6” is bad, and reduces the cumulative value of the DNA by multiplying it by 0.8. A word comprising entirely digits, like “46”, adds to the cumulative value of the DNA a number of points equal to its length. The word “M” doubles the cumulative value of the DNA. And a word like “AG12589” — the letter “A” followed by only non-digits followed by only digits — multiplies the cumulative value of the DNA by the numeric value of its digits part.

Now, to be sure, we’re not really attempting to solve any earth-shattering problem here; we’re just playing around, to see how our population will evolve over time in response to the pressure of these rules. And the interesting thing is that it evolves very well indeed. In a typical run of 100 generations, starting with completely random DNA, I ended up with a generation where the best individual’s DNA (ignoring extra spaces) was this:

458 7 O 3 0 E1 2 9 8K A7 M 1 9 A826514143 A86655049164448

Like Zlatanov, I find I can’t help thinking of my population as made up of live entities, a Petrie dish of evolving bacteria. Well, these are some clever bacteria! We find only three “bad” words out of fifteen. The bacteria have taken advantage of the “M” rule, but not much, because it isn’t really worth wasting words on. Mostly, they’ve seen, the way to go is to take advantage of the rule about “A” followed by a number, and in particular to use big numbers and to put them at the end of the bytestream where they do the most good. I find the consistency and insightfulness of these results quite astounding. There is a best strategy for implementing the given rules to achieve maximal results, and our bacteria have found it.

My implementation starts with a class called Individual; the Population is an array of 1024 Individuals. An Individual has a DNA property, which is a memoryblock of size 64. It also has a Fitness (double) property, to hold the results of evaluating its DNA, and some other properties that aren’t important right now.

The application’s core loop lists the steps typical of a genetic algorithm.

Core loop:

  do
    evaluateFitness
    letTotallyUnfitDie
    selectParents
    breed
    mutate
    generation = generation + 1
  loop until generation >= generation_max

EvaluateFitness works out the value of each Individual’s DNA and assigns this to its Fitness property.

LetTotallyUnfitDie removes from the Population all Individuals whose Fitness is below some minimal threshold (0.1 in my implementation). I don’t know whether this step is entirely necessary; it doesn’t seem to be present in most genetic algorithm implementations, and in practice I find it largely superfluous, since after a few generations all Individuals score above the threshold anyway. But it was present in Zlatanov’s implementation, which I was imitating, and it seemed to do no harm, so I left it in.

With SelectParents, we come to the really interesting part. We wish to obtain a subpopulation from which we will draw all the mating pairs that will pass their combined DNA along to the next generation. One obvious way would be simply to throw away the entire lower-ranked portion of the population; but this seems more like Brave New World than survival of the fittest. It seems kinder, and more in keeping with the evolutionary principles of natural selection, to choose our potential parents at random, but with a randomization weighted in terms of each individual’s fitness; thus, the fittest individual will very likely be allowed to breed, and the least fit individual very likely not, but without complete certainty in the matter.

At first, therefore, my approach, inspired by Zlatanov, was to imagine that I was handing out tickets to the Individuals; each ticket would entitle its possessor to mate once. Since I was planning to replace the entire population after each generation, and since it takes two parents to make a child, I clearly needed a number of tickets twice the size of the population. I proposed to hand out these tickets at random, but weighted in proportion to the Individuals’ Fitness value.

I implemented this approach, and it worked. But I noticed, after several runs of the program, that there was a strong tendency for the population essentially to stop evolving after just a dozen or so generations. What was happening was that an individual would emerge whose DNA was so much fitter than everyone else’s that he would get most of the tickets, most of the children would be his, and most of the next generation would look like him. It looked like a solution, but it wasn’t; on the contrary, this individual was stifling the further evolution of the population as a whole.

At this point I went back to the web, and found a superb set of pages by Harmut Polheim (see reference 3). It turns out, as Polheim explains, that this is a well-known consequence of my first approach — “proportional fitness assignment” tends towards “premature convergence”. The solution, Polheim says, is to use “rank-based fitness assignment”. Here the population is sorted into fitness order and then each individual is assigned a new fitness based entirely on his position in the sort, according to a formula that Polheim provides (“linear ranking”). Thus it doesn’t matter whether the best individual is a little fitter or a lot fitter than the second best; we take account only of the fact that he is fitter. I applied Polheim’s formula and stored the results in a new property of each Individual, which I called FitnessToBreed.

The question still remains of how to actually hand out the tickets — that is, how to obtain a weighted sample. I used the same system in both implementations; it’s just that, on the second go, the weight was the Individuals’ FitnessToBreed instead of their Fitness. First I create an array of doubles, called Bins. Into each Bin I put the cumulative total of all the Individuals’ FitnessToBreed up to that point. Thus, the first Bin contains the FitnessToBreed of the first Individual; the second Bin contains the FitnessToBreed of the first Individual plus that of the second Individual; and so forth, up to the last Bin which contains the total of all the FitnessToBreed values. To hand out a ticket, I generate a random number smaller than that total, and run up the Bins looking for the first one whose value isn’t smaller than this random number. The Individual corresponding to that Bin gets the ticket.

Of course I could have done this without the Bins: I could have picked my random number and then run through the Individuals, adding their FitnessToBreed values as I went along. But this would have involved performing the same additions to obtain the same sums, over and over, for each of 2048 tickets. The Bins are simply a way of precalculating those sums, once per generation. Anyway, it turned out on reading Polheim later, that my approach here was quite standard — it’s called “roulette wheel selection” or “stochastic sampling with replacement”. It also turns out that it is only one of many weighted random selection techniques, but I’m not going to go into the details here.

Here, then, is the actual code I ended up using for handing out the tickets. In next issue’s column I’ll describe my implementation of mating, breeding, and mutation.

SelectParents:

  dim i,u as integer, sp as double
  u = ubound(population)
  dim bins(-1) as double
  redim bins(u)
  heapsort population // see RBD 1.3 
  sp = selective_pressure // a constant, 1.5 here 
  for i = 0 to u // formula from Polheim 
    population(i).fitnessToBreed = 2.0 - sp + 2.0 * (sp - 1.0) * i / u
  next
  dim total as double
  for i = 0 to u
    total = total + population(i).fitnessToBreed
    bins(i) = total
  next
  u = popsize*2
  for i = 1 to u
    population(sample(bins)).getParentingTicket
  next

Sample:

  dim i,u as integer, target as double
  u = ubound(bins)
  target = app.r.lessThan (ceil(bins(u)))
  for i = 0 to u
    if target <= bins(i) then return i
  next
  return u // just in case
References
1. http://www.scs.carleton.ca/~csgs/resources/gaal.html
2. http://www-106.ibm.com/developerworks/linux/library/l-genperl2/?ca=dgr-lnxw03GenAlg2
3. http://www.geatbx.com/docu/algindex-01.html#TopOfPage