Understanding Evolution, Part II: Sampling and Drift
In Part I, we built up a simple model of a randomly breeding population which allowed for different rates of reproduction for organisms with different genotypes. However, our tidy little model represents only a frozen moment of the dynamic process of evolutionary change over many generations. In order to understand how evolution proceeds, we need to understand the nature of the change in allele frequencies from one generation to the next. To this end, let’s start by zooming in on the genetic skeleton of what’s going on in each round of mating: Individuals are paired at random from the population to breed, and for each breeding event one of the two alleles at each locus is selected at random from each parent to be copied and combined into a new individual. All of the resultant new individuals are pooled into the new population for the next round of mating, the last population is killed off, and the cycle repeats again.
This process allows for a healthy amount of randomness: Not only are the pairings of parents and their alleles random, but the number of offspring had from each pairing can fluctuate randomly, too, due to the caprices of fate if nothing else. So the mating process is stochastic, which is just another bit of neat-sounding jargon that means “random”. But randomness in a process doesn’t mean we can’t make any predictions at all about it; in most well-behaved stochastic processes there’s going to be an expected value for its outcome and some slop around that expectation. When you’re flipping a fair coin, you expect it to come up heads about half the time, but sometimes you’ll get runs of several heads in a row, even though that’s not “supposed” to happen. But what do we mean by “expectation” and “slop”, and how much of the latter is there?
The early probability theorists of the 18th century answered these questions for several different kinds of stochastic process, and the properties of the particular process we’re interested in are pretty intuitive. In this case, each breeding event can be seen as taking two allele samples from the gene pool for one generation and throwing it into the gene pool of the next generation. Each individual sample is equivalent to the flipping of a coin that can be loaded to varying degrees such that the value of p, the probability of “heads” — or in this case, the probability of drawing some allele A — can be anything between 0 and 1. If you were playing a game such that you won a dollar every time a coin came up heads and lost a dollar every time it came up tails, over the long run you’d expect to end up no richer or poorer. Alleles are playing exactly the same game, only they’re playing for loci in the next generation and the payoff for “winning” is either one slot or nothing (i.e. 1 or 0). So the expected value of p after an arbitrary number of rounds of mating is just p. Symbolically:
E(p′) = p(1) + (1 - p)(0) = p.
So that’s your expectation, but as anyone who’s had a streak of good or bad luck at a game of chance knows, life often surprises you. This is where the slop comes in, although the more proper term for it is “spread” — the greater the spread, the more all-over-the-map the outcomes of a stochastic process will be; the smaller the spread, the more well-behaved and conformist the outcomes will be. One way to precisely characterize the spread of a stochastic process is by a very well-defined quantity called the variance. If we recall the formula for variance from a previous post, we can substitute some variables to suit our purpose and do a little algebra to get:
Var(X) = p(1 - p)2 + (1 - p)(0 - p)2 = p3 - 2p2 + p + p2 - p3 = p - p2 = p(1 - p)
And since (1 - p) = q, then the variance of a single draw is just pq. Assuming that the population size stays constant from one generation to the next, there’ll be a total of 2N draws, where N is the number of individuals in the population. Averaging the variance over the total number of draws gives us:
Var(p′) = pq / 2N
This metric can be thought of as the sampling error from one generation to the next. The take-home message of this equation is that the amount of fluctuation in allele frequencies due to sampling error is inversely proportional to population size. As N tends to infinity, variance tends to zero, and vice-versa. Which is what we ought to expect: intuitively we know that the larger a sample we draw from, the more the effects of random fluctuations will tend to average out.
The point of this brief but painful digression was an indirect approach to introducing the second most famous force in evolution, genetic drift, which is no more or less than the effect of sampling error as outlined above, repeated over many generations. In the game of drift, each allele at a locus has an even chance of becoming the ancestor of all the alleles at that locus at some unspecified future date, at which point it’s said to have swept to fixation. For a brand new mutation whose frequency is affected only by drift, the probability of sweeping to fixation is 1/2N, like rolling a many-sided die with a number of faces identical with the number of genetic slots available within the population (still assuming for simplicity a fixed population size). More generally, we can bring back the quantity cA from Part I to say that in drift conditions, the probability of an allele A sweeping to fixation is cA/2N. Again we see that probability of fixation by drift is inversely proportional to population size.
How important is such randomness in evolution, then? The answer “it depends on the population size” can now seen to be true, but is incomplete. What ultimately matters is how strong the stochastic forces of drift are relative to the systematic force of natural selection, which we’ll explore next.