Understanding Evolution, Part II: Sampling and Drift

In Part I, we built up a simple model of a randomly breeding population which allowed for different rates of reproduction for organisms with different genotypes. However, our tidy little model represents only a frozen moment of the dynamic process of evolutionary change over many generations. In order to understand how evolution proceeds, we need to understand the nature of the change in allele frequencies from one generation to the next. To this end, let’s start by zooming in on the genetic skeleton of what’s going on in each round of mating: Individuals are paired at random from the population to breed, and for each breeding event one of the two alleles at each locus is selected at random from each parent to be copied and combined into a new individual. All of the resultant new individuals are pooled into the new population for the next round of mating, the last population is killed off, and the cycle repeats again.

This process allows for a healthy amount of randomness: Not only are the pairings of parents and their alleles random, but the number of offspring had from each pairing can fluctuate randomly, too, due to the caprices of fate if nothing else. So the mating process is stochastic, which is just another bit of neat-sounding jargon that means “random”. But randomness in a process doesn’t mean we can’t make any predictions at all about it; in most well-behaved stochastic processes there’s going to be an expected value for its outcome and some slop around that expectation. When you’re flipping a fair coin, you expect it to come up heads about half the time, but sometimes you’ll get runs of several heads in a row, even though that’s not “supposed” to happen. But what do we mean by “expectation” and “slop”, and how much of the latter is there?

The early probability theorists of the 18th century answered these questions for several different kinds of stochastic process, and the properties of the particular process we’re interested in are pretty intuitive. In this case, each breeding event can be seen as taking two allele samples from the gene pool for one generation and throwing it into the gene pool of the next generation. Each individual sample is equivalent to the flipping of a coin that can be loaded to varying degrees such that the value of p, the probability of “heads” — or in this case, the probability of drawing some allele A — can be anything between 0 and 1. If you were playing a game such that you won a dollar every time a coin came up heads and lost a dollar every time it came up tails, over the long run you’d expect to end up no richer or poorer. Alleles are playing exactly the same game, only they’re playing for loci in the next generation and the payoff for “winning” is either one slot or nothing (i.e. 1 or 0). So the expected value of p after an arbitrary number of rounds of mating is just p. Symbolically:

E(p′) = p(1) + (1 - p)(0) = p.

So that’s your expectation, but as anyone who’s had a streak of good or bad luck at a game of chance knows, life often surprises you. This is where the slop comes in, although the more proper term for it is “spread” — the greater the spread, the more all-over-the-map the outcomes of a stochastic process will be; the smaller the spread, the more well-behaved and conformist the outcomes will be. One way to precisely characterize the spread of a stochastic process is by a very well-defined quantity called the variance. If we recall the formula for variance from a previous post, we can substitute some variables to suit our purpose and do a little algebra to get:

Var(X) = p(1 - p)2 + (1 - p)(0 - p)2 = p3 - 2p2 + p + p2 - p3 = p - p2 = p(1 - p)

And since (1 - p) = q, then the variance of a single draw is just pq. Assuming that the population size stays constant from one generation to the next, there’ll be a total of 2N draws, where N is the number of individuals in the population. Averaging the variance over the total number of draws gives us:

Var(p′) = pq / 2N

This metric can be thought of as the sampling error from one generation to the next. The take-home message of this equation is that the amount of fluctuation in allele frequencies due to sampling error is inversely proportional to population size. As N tends to infinity, variance tends to zero, and vice-versa. Which is what we ought to expect: intuitively we know that the larger a sample we draw from, the more the effects of random fluctuations will tend to average out.

The point of this brief but painful digression was an indirect approach to introducing the second most famous force in evolution, genetic drift, which is no more or less than the effect of sampling error as outlined above, repeated over many generations. In the game of drift, each allele at a locus has an even chance of becoming the ancestor of all the alleles at that locus at some unspecified future date, at which point it’s said to have swept to fixation. For a brand new mutation whose frequency is affected only by drift, the probability of sweeping to fixation is 1/2N, like rolling a many-sided die with a number of faces identical with the number of genetic slots available within the population (still assuming for simplicity a fixed population size). More generally, we can bring back the quantity cA from Part I to say that in drift conditions, the probability of an allele A sweeping to fixation is cA/2N. Again we see that probability of fixation by drift is inversely proportional to population size.

How important is such randomness in evolution, then? The answer “it depends on the population size” can now seen to be true, but is incomplete. What ultimately matters is how strong the stochastic forces of drift are relative to the systematic force of natural selection, which we’ll explore next.

The Mechanics of Statistics, Part I

One of the worst sins of math textbooks is that they rarely place the theorems and methods they present in a historical context that allows the student to see how they developed as a natural progression on earlier work in response to new problems, or how mathematicians themselves thought about the problems they grappled with. One could pass all of one’s high school math classes without ever once understanding why ellipses, parabolas and hyperbolas are collectively referred to as conics, or just what the hell e is and why it keeps popping up all over the place.

For instance, most statistics textbooks never bother to mention that the theoretical pioneers of the early 20th century such as Karl Pearson and R.A. Fisher who made statistics a full-blooded subfield of mathematics in its own right were trained first and foremost in physics, and consequently often thought of the other problems they worked on by analogy with mechanical concepts. Presumably this is because the authors of such texts don’t think this would add anything to the understanding of the concepts; as someone who finds physical and geometrical analogies greatly improve his own intuitive understanding of more abstract mathematics, this author thinks differently. The purpose of this post and its planned sequel is to draw attention to the physical interpretations of statistical concepts, in the hope that one might shed some light on the other.

Let’s say you’ve got a system of mass points rotating rigidly around a fixed center, and possess the coordinates for each mass point within some frame of reference at some moment of time, but don’t know the precise coordinates of the center of mass around which they’re turning. Not a problem: one can easily calculate the location of this center of mass by taking the average of all the individual positions of the particles, weighted by the contribution their individual masses make to the total mass of the system. If we represent masses of the individual points and total mass of the system as mi and M respectively, and the location of the points and the location of the center likewise as as ri and R, the center of mass can be defined as:

R = Σ (mi / M)ri

Where the large Greek letter sigma stands for “the sum of all”. If this looks familiar, it should — this is just a special case of calculating the mean of a distribution of data points, which in statistics is usually symbolized as:

μ = Σ (ni / N)xi

Here N is the total number of members of the distribution and ni is the number of members with the particular value xi, and dividing the former by the latter gives us the relative frequency of xi, which we can also represent by defining ni / N = P(xi), yielding μ = Σ P(xi)xi.

Incidentally, this helps make more concrete the most curious feature of a mean, which of course is that it’s entirely possible that no member of a population might actually be “average” in this precise sense of the word. But just as knowing a system’s center of mass has predictive power in plotting the trajectory of that system, the more generalized concept of a mean gives us predictive power in predicting the trajectory of all sorts of many-body systems. Theoretical abstracta that are not literally “there” in what we’d consider to be a concrete sense can be perfectly well-behaved and perfectly indispensable in even the most hard-nosed of sciences, which is a point to keep in mind when one hears complaints about the “reification” of theoretical constructs in the “softer” sciences.

Returning now to our system of particles, we can continue the analogy further. Each point about the center of mass is endowed with a moment of inertia, and so too is the total system of points itself. We know intuitively that inertia decreases as a mass point gets closer to the center of mass, which is why a figure skater in a spin starts spinning faster when she pulls her arms toward her body: there’s no net external force acting on her to increase her angular velocity, merely a reduction in her overall inertia that permits the same amount of kinetic energy to translate to faster rotation. We can define the system’s moment of inertia analogously with the way we defined its center of mass, as the sum of all the moments of inertia for all the individual points:

I = Σ mi(ri - R)2

Once again, we have in the mechanical case a special case of a more abstract statistical principle, less obvious than the first: The moment of inertia of the system bears precisely the same relationship to its center of mass that the variance of a statistical distribution bears to its mean. The variance is a measure of the dispersal of data points about the mean, and one way it’s commonly symbolized is:

Var(X) = Σ P(xi)(xi - μ)2

What makes this less analogy obvious is that when dealing with mechanical systems we generally keep our masses on an absolute scale, whereas in statistics the weights of various data points are relativized to their sum total, which is set equal to 1. But all we need to do is arbitrarily suppose that the total mass of the physical system of points is equal to one unit of whatever we wish, and the analogy becomes apparent at once.

In dealing with rotating bodies, there’s a quantity known as the radius of gyration, which describes the distribution of mass about the axis of rotation, and is defined as:

kg = (I / M)1/2 = √(I / M)

Since we’ve arbitrarily decided that M = 1, in this case kg = √I. The radius of gyration is formally identical with the standard deviation of a statistical distribution, generally denoted:

σ = √Var(X)

In both cases, this quantity is used more as a convenience when manipulating equations than out of necessity, because inertia and variance can both be thought of as being in squared units, which can be annoying to work with in some cases. Both describe the distribution equally well, but which you prefer to use will depend on what your purpose is.

Up until now we’ve been vague about the dimensions of this system of points we’re working with; inertia as we’ve used it here is a one-dimensional quantity in the sense that it’s concerned with rotation around a single given axis. But working in two or three dimensions means that you have two or three possible axes of rotation, and in general unless an object is symmetrical about all axes the moments of inertia for each axis won’t be identical. If you’re spinning a sphere, it takes the same amount of force to spin it in any direction; if you’re spinning a bottle, which axis you rotate it around makes a difference. So for irregular multidimensional objects it makes sense to break center of mass and inertia down into separate components for each axis. So too for variance, which measures the spread of data points along one dimension only.

But just as in mechanics we sometimes want to describe how a system behaves as a whole in two or three dimensions, often we want to compare data points along multiple dimensions at once. We’ll see how to do this in Part II.

Understanding Evolution, Part I: First Principles

When thinking about the Darwinian idea of evolution people sometimes start by grabbing the wrong end of the stick, treating it as if it were just a historical conjecture about the genealogical descent of organisms. Admittedly this isn’t helped by the fact that the title of Darwin’s masterwork is always shorthanded as The Origin of Species; if its longer title On the Origin of Species by Means of Natural Selection had been preserved, or shorthanded instead to merely Natural Selection, then much confusion may have been avoided. Darwin was seeking an unifying explanation of not merely how species got to be the way they are, but why.

As has been more clearly brought out by modern commentators such as Richard Dawkins, Richard Lewontin and Daniel Dennett, while Darwin spent a lot of time buttressing his argument with copious examples from the actual world around us, his core argument is in fact an abstract syllogistic one whose conclusions follow inexorably from the truth of its premises:

  • (P1) There is variation among individuals in a population of reproducing entities with finite lifespans.
  • (P2) At least some of this variation is inherited from an entity to its offspring.
  • (P3) The existing variation influences the number of offspring any given entity has, relative to the population average.
  • (C) Over many generations of reproduction and death, the entire population will come to possess the inherited traits of those individuals that had the most offspring in previous generations.

Natural selection will thus happen anywhere that there’s hereditary variation with differential reproduction, regardless of what the entities under selection are. Like conservation laws in physics, this is a simple and extremely powerful formulation that can be used to predict and explain many things. As with physical laws, we can do better than merely stating it in words and can in fact quantify it precisely — but first we have to replace the fuzzy words with crisp parameters based on commensurable properties of well-behaved objects.

In the case of the biological evolution Darwin had his sights set on, the objects we have to work with are organisms and the genetic material that defines their developmental recipe. E. coli carry around recipes inside them for making E. coli, and you carry around a recipe for making an organism like you. But within a specific species, or even a family, not all recipes are going to be the same, which is why I have blue eyes while my mother had green: there is some difference between us in the genetic locus of control over eye color. (”Locus” is just another word for “location” [on a chromosome], but geneticists prefer to use the former because it’s shorter and sounds cooler.) The bit of jargon geneticists invented to denote that difference is to say that we have a different set of alleles, which can be thought of as variations on a genetic theme. And here is where we find our first important measurable quantity: the number of copies of a given allele at a given locus over the entire population we’re interested in, which we’ll dub c.

Let’s say that there are only two alleles currently existing within a population — call them A and B. In that case, we have two quantities: cA and cB. Since these two quantities exhaust the possibilities at that locus, we can in turn derive another pair of variables from them with some elementary algebra:

cA / (cA + cB) = p
cB / (cA + cB) = q

The variables p and q are the relative frequencies of alleles A and B, respectively, within the population, and it’s these frequencies we’ll be working with from here on out. By definition, p + q = 1 because, again, they exhaustively describe the only two possibilities for any gene in our toy model. But of course, as you’ll recall from your high school biology, we actually have two genes at every locus, being diploid organisms and all, which gets important in genetics when we run up against dominance effects but only complicates the math ever so slightly. To model this, we simply square both sides of the equation like so:

(p + q)(p + q) = (1)(1)
p2 + 2pq + q2 = 1

This is typically the first equation you learn in any population genetics course, and embodies what’s called the Hardy-Weinberg principle. Now the terms refer not to genes but to genotypes (pairs of genes at a locus): p2 and q2 denote the frequencies of AA and BB genotypes, respectively, which are referred to as homozygous types. The remaining term, 2pq, denotes the frequency of heterozygotes — those with AB or BA combinations. Again this exhausts all possibilities, so the three terms sum to unity. (The algebraically sharp reader will note that this principle can be expanded to accommodate any number of alleles at a locus, but we’re keeping things simple.)

The immediate cash value that falls out from this equation is that all we need to know is the value of p (or q) in order to know the frequencies of all three possible genotypes; likewise, all we need to know is the frequency of one of the three possible genotypes and we immediately know the other two and both allele frequencies. Which, if you’re a geneticist, comes in pretty handy: let’s say you know that 1 in 2500 people suffer from a recessive genetic disorder. Since recessive diseases only appear in homozygotes, you know that:

p2 = 1/2500 = 0.0004
p = 0.02
q = 1 - 0.02 = 0.98
q2 = 0.9604

2pq = 2(0.02)(0.98) = 0.0392

So about 4% of the population are carriers for the disease. Additionally, the equation tells us something non-obvious about rare alleles: that if an allele is rare, you’re overwhelmingly more likely to find it present in a heterozygote than a homozygote — in the example just used, 98 times more. This is more than we knew a little while ago, and we haven’t even gone beyond mere allele-counting.

Despite its utility, the Hardy-Weinberg equation can only describe a terribly dull world where there’s no selection, no change in the (very large) population size, no gene flow, no nonrandom mating, no mutation, no drinking and no loud music. In the real world, different genotype bearers have different numbers of offspring, which gives rise to selection. This brings us to our next measurable quantity: the number of offspring of an organism in the population, which we’ll denote with w. If we were to measure w for every member of the population and then total them up, we could derive wM — the mean number of offspring for the entire population for one round of breeding — as well as wAA, wAB and wBB — the mean number offspring had by bearers of the three respective genotypes.

Having these quantities allows us to augment the Hardy-Weinberg equation to reflect differential reproduction, multiplying each term by its corresponding fitness value thusly:

p2wAA + 2pqwAB + q2wBB = wM

Note that we’re tacitly retaining the assumption of a fixed population size, which is to say that an increase in the average number of offspring of one type necessitates a decrease in the offspring had by one of the other types. This is the classic “struggle for existence” that occurs when a population is already at its ecology’s maximum carrying capacity. Further down the road we’ll twiddle this assumption and see what happens, but even now you can see the outlines of evolution starting to emerge. The Darwinian syllogism is now complete: we have a model of hereditary differences linked to differential reproduction, all in one equation. But there’s much more to evolution than this, as we’ll soon see.

Understanding Evolution: Prolegomena

“. . . hardly anyone ‘believes’ in evolution, in the sense that electrical engineers believe in Maxwell’s equations.”
Gregory M. Cochran (2003)

“A combination of philosophy and calculus. You can see how selection got its reputation as black magic.”
John D. Hawks (2007)

“If you can’t stand algebra, stay away from evolutionary biology.”
– John Maynard-Smith, Evolutionary Genetics (1989)

For a while I was planning on double-majoring in computer science and biology, and whenever I told people this they’d typically remark on what an odd combination it was. To someone working in genetics the combination would seem perfectly natural, but the puzzled reaction I got among laypeople — even among those that accept evolution as the grand unifying theory of biology — shows that very few people have an accurate mathematical understanding of how evolution operates.

This is the start of a series of posts geared toward those who’ve perhaps read a popular book or two on the subject of evolution, but have never taken a quantitative course on genetics and have only a vague, intuitive mental model of evolutionary dynamics. There’s a lot of exciting stuff going on in evolutionary genetics these days, but in order to make sense of it you need to be able to conceptualize the underlying process clearly and accurately. This requires math, but following Alfred Marshall’s dictum that math should be used “as a shorthand language, rather than as an engine of inquiry”, the formalisms will be simple, used sparingly and only to introduce precision, and always translated to English.

But first, let’s talk about searching. A very large class of problems can be thought of as a matter of selecting the right solution from a large space of possibilities, and so a large part of computer science research has historically been devoted to finding the most efficient ways of carrying out such searches. If your search space is big, then one of the least efficient ways of solving your problem would be to pick solutions at random, walking (or hopping) stochastically around the search space and hoping you hit on the right one. (Another inefficient method would be to crank through them in sequential order.) If it’s a truly Vast space, following such a dumb algorithm could easily leave you waiting quite literally until the end of time. But with a more intelligent algorithm, you can have the right answer by lunchtime, never mind doomsday.

Take a simple example that most children are familiar with: The hot/cold game, where one player conceals an object in a large environment and the other has to find it, with the help of feedback from the first player telling them whether they’re getting closer or farther from the hidden goal. If you were looking through a large house for some small object with no clue as to its whereabouts, you could spend all day (or several days) tearing the place apart in your search before finding it, if you even found it at all. But when you have someone there to tell you when you’re getting “warmer” or “colder”, you can often find it in just a few minutes. There’s nothing about this game that intrinsically requires human levels of smarts: It could be carried out by a couple of dumb robots, because all both players are doing is executing an algorithm.

This, in a nutshell, is what natural selection does. Given a high-dimensional design space of vastly many possible phenotypes, vanishingly few of which will even be viable (let alone thrive), you need an efficient search algorithm if you’re going to find any viable forms of life before the heat death of the universe. This is what R. A. Fisher meant when he described natural selection as “a mechanism for generating an exceedingly high degree of improbability” — improbable though it may be to find viable life forms in the space of possible forms, once the necessary material is available natural selection can propel a population toward prime real estate with surprising rapidity. In subsequent installments of this series we’ll see more precisely how it does this, starting with a few simple measurable parameters and basic transition rules, and deducing some surprising results.

(A note on presentation: One often has the choice of modeling a dynamic process such as evolution with either discrete or continuous mathematics. As alluded to by Hawks in the quote above, the continuous representation is arguably more elegant and gets more swiftly and smoothly to the heart of things, but since most people tune out as soon as they see derivatives and since most presentations of evolutionary genetics rely on discrete algebraic models, I’ll generally hew to the discrete tradition in what follows except where logistic functions are simply unavoidable.)

Eadem Mutata Resurgo

The above bit of Latin adorns the grave of Jakob Bernoulli, and translates to English as “though transformed, I arise again the same”. In other words, here we go again.