
Now this series on information geometry will take an unexpected turn toward 'green mathematics'. Lately I've been talking about relative entropy. Now I'll say how this concept shows up in the study of evolution!
That's an unexpected turn to me, at least. I learned of this connection just two days ago in a conversation with Marc Harper, a mathematician who is a postdoc in bioinformatics at UCLA, working with my friend Chris Lee. I was visiting Chris for a couple of days after attending the thesis defenses of some grad students of mine who just finished up at U.C. Riverside. Marc came by and told me about this paper:
and now I can't resist telling you.
First of all: what does information theory have to do with biology? Let me start with a very general answer: biology is different from physics because biological systems are packed with information you can't afford to ignore.
Physicists love to think about systems that take only a little information to describe. So when they get a system that takes a lot of information to describe, they use a trick called 'statistical mechanics', where you try to ignore most of this information and focus on a few especially important variables. For example, if you hand a physicist a box of gas, they'll try to avoid thinking about the state of each atom, and instead focus on a few macroscopic quantities like the volume and total energy. Ironically, the mathematical concept of information arose first here—although they didn't call it information back then; they called it 'entropy'. The entropy of a box of gas is precisely the amount of information you've decided to forget when you play this trick of focusing on the macroscopic variables. Amazingly, remembering just this—the sheer amount of information you've forgotten—can be extremely useful... at least for the systems physicists like best.
But biological systems are different. They store lots of information (for example in DNA), transmit lots of information (for example in the form of biochemical signals), and collect a lot of information from their environment. And this information isn't uninteresting 'noise', like the positions of atoms in a gas. The details really matter. Thus, we need to keep track of lots of information to have a chance of understanding any particular biological system.
So, part of doing biology is developing new ways to think about physical systems that contain lots of relevant information. This is why physicists consider biology 'messy'. It's also why biology and computers go hand in hand in the subject called 'bioinformatics'. There's no avoiding this: in fact, it will probably force us to automate the scientific method! That's what Chris Lee and Marc Harper are really working on:
• Chris Lee, General information metrics for automated experiment planning, presentation in the UCLA Chemistry & Biochemistry Department faculty luncheon series, 2 May 2011.
But more about that some other day. Let me instead give another answer to the question of what information theory has to do with biology.
There's an analogy between evolution and the scientific method. Simply put, life is an experiment to see what works; natural selection weeds out the bad guesses, and over time the better guesses predominate. This process transfers information from the world to the 'experimenter': the species that's doing the evolving, or the scientist. Indeed, the only way the experimenter can get information is by making guesses that can be wrong.
All this is simple enough, but the nice thing is that we can make it more precise.
On the one hand, there's a simple model of the scientific method called 'Bayesian inference'. Assume there's a set of mutually exclusive alternatives: possible ways the world can be. And suppose we start with a 'prior probability distribution': a preconceived notion of how probable each alternative is. Say we do an experiment and get a result that depends on which alternative is true. We can work out how likely this result was given our prior, and—using a marvelously simple formula called Bayes' rule—we can use this to update our prior and obtain a new improved probability distribution, called the 'posterior probability distribution'.
On the other hand, suppose we have a species with several different possible genotypes. A population of this species will start with some number of organisms with each genotype. So, we get a probability distribution saying how likely it is that an organism has any given genotype. These genotypes are our 'mutually exclusive alternatives', and this probability distribution is our 'prior'. Suppose each generation the organisms have some expected number of offspring that depends on their genotype. Mathematically, it turns out this is just like updating our prior using Bayes' rule! The result is a new probability distribution of genotypes: the 'posterior'.
I learned about this from Chris Lee on the 19th of December, 2006. In my diary that day, I wrote:
The analogy is mathematically precise, and fascinating. In rough terms, it says that the process of natural selection resembles the process of Bayesian inference. A population of organisms can be thought of as having various 'hypotheses' about how to survive—each hypothesis corresponding to a different allele. (Roughly, an allele is one of several alternative versions of a gene.) In each successive generation, the process of natural selection modifies the proportion of organisms having each hypothesis, according to Bayes' rule!Now let's be more precise:
Bayes' rule says if we start with a 'prior probability' for some hypothesis to be true, divide it by the probability that some observation is made, then multiply by the 'conditional probability' that this observation will be made given that the hypothesis is true, we'll get the 'posterior probability' that the hypothesis is true given that the observation is made.
Formally, the exact same equation shows up in population genetics! In fact, Chris showed it to me—it's equation 9.2 on page 30 of this book:
• R. Bürger, The Mathematical Theory of Selection, Recombination and Mutation, section I.9: Selection at a single locus, Wiley, 2000.
But, now all the terms in the equation have different meanings!
Now, instead of a 'prior probability' for a hypothesis to be true, we have the frequency of occurrence of some allele in some generation of a population. Instead of the probability that we make some observation, we have the expected number of offspring of an organism. Instead of the 'conditional probability' of making the observation, we have the expected number of offspring of an organism given that it has this allele. And, instead of the 'posterior probability' of our hypothesis, we have the frequency of occurrence of that allele in the next generation.
(Here we are assuming, for simplicity, an asexually reproducing 'haploid' population  that is, one with just a single set of chromosomes.)
This is a great idea—Chris felt sure someone must have already had it. A natural context would be research on genetic programming, a machine learning technique that uses an evolutionary algorithm to optimize a population of computer programs according to a fitness landscape determined by their ability to perform a given task. Since there has also been a lot of work on Bayesian approaches to machine learning, surely someone has noticed their mathematical relationship?
I see that Amy Perfors found these ideas as new and exciting as I did. But I still can't believe Chris was the first to clearly formulate them, so I'd still like to know who did.
Marc Harper actually went to work with Chris after reading that diary entry of mine. By now he's gone a lot further with this analogy by focusing on the role of information. As we keep updating our prior using Bayes' rule, we should be gaining information about the real world. This idea has been made very precise in the theory of 'machine learning'. Similarly, as a population evolves through natural selection, it should be gaining information about its environment.
I've been talking about Bayesian updating as a discretetime process: something that happens once each generation for our population. That's fine and dandy, definitely worth studying, but Marc's paper focuses on a continuoustime version called the 'replicator equation'. It goes like this. Let \(X\) be the set of alternative genotypes. For each \(i \in X\), let \( P_i \) be the number of organisms that have the \( i \)th genotype at time \(t\). Say that
$$\displaystyle{ \frac{d P_i}{d t} = f_i P_i } $$where \(f_i\) is the fitness of the \( i\)th genotype. Let \(p_i\) be the probability that at time \(t\), a randomly chosen organism will have the \(i\)th genotype:
$$\displaystyle{ p_i = \frac{P_i}{\sum_{i \in X} P_i } }$$Then a little calculus gives the replicator equation:
$$\displaystyle{\frac{d p_i}{d t} = \left( f_i  \langle f \rangle \right) \, p_i } $$where
$$\langle f \rangle = \sum_{i \in X} f_i p_i $$is the mean fitness of the organisms. So, the fraction of organisms of the \(i\)th type grows at a rate proportional to the fitness of that type minus the mean fitness. It ain't enough to be good: you gotta be better than average.
Note that all this works not just when each fitness \(f_i\) is a mere number, but also when it's a function of the whole list of probabilities \(p_i\). That's good, because in the real world, the fitness of one kind of bug may depend on the fraction of bugs of various kinds.
But what does all this have to do with information?
Marc's paper has a lot to say about this! But just to give you a taste, here's a simple fact involving relative entropy, which was first discovered by Ethan Atkin. Suppose evolution as described by the replicator equation brings the whole list of probabilities \(p_i\)—let's call this list \(p\)—closer and closer to some stable equilibrium, say \(q\). Then if a couple of technical conditions hold, the entropy of \(q\) relative to \(p\) keeps decreasing, and approaches zero.
Remember what I told you about relative entropy. In Bayesian inference, the entropy \(q\) relative to \(p\) is how much information we gain if we start with \(p\) as our prior and then do an experiment that pushes us to the posterior \(q\). So, in simple rough terms: as it approaches a stable equilibrium, the amount of information a species has left to learn keeps dropping, and goes to zero!
I won't fill in the precise details, because I bet you're tired already. You can find them in Section 3.5, which is called "KullbackLeibler Divergence is a Lyapunov function for the Replicator Dynamic". If you know all the buzzwords here, you'll be in buzzword heaven now. 'KullbackLeibler divergence' is just another term for relative entropy. 'Lyapunov function' means that it keeps dropping and goes to zero. And the 'replicator dynamic' is the replicator equation I described above.
Perhaps next time I'll say more about this stuff. For now, I just hope you see why it makes me so happy.
First, it uses information geometry to make precise the sense in which evolution is a process of acquiring information. That's very cool. We're looking at a simplified model—the replicator equation—but doubtless this is just the beginning of a very long story that keeps getting deeper as we move to less simplified models.
Second, if you read my summary of Chris Canning's talks on evolutionary game theory, you'll see everything I just said meshes nicely with that. He was taking the fitness \(f_i\) to be
$$f_i = \sum_{j \in X} A_{i j} p_j $$where the payoff matrix \(A_{i j}\) describes the 'winnings' of an organism with the \(i\)th genotype when it meets an organism with the \(j\)th genotype. This gives a particularly nice special case of the replicator equation.
Third, this particularly nice special case happens to be the rate equation for a certain stochastic Petri net. So, we've succeeded in connecting the 'diagram theory' discussion to the 'information geometry' discussion! This has all sort of implications, which will take quite a while to explore.
As the saying goes, in mathematics:
Everything sufficiently beautiful is connected to all other beautiful things.
You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!
