
It's time to continue this information geometry series, because I've promised to give the following talk at a conference on the mathematics of biodiversity in early July... and I still need to do some of the research!
Diversity, information geometry and learning
As is well known, some measures of biodiversity are formally identical to measures of information developed by Shannon and others. Furthermore, Marc Harper has shown that the replicator equation in evolutionary game theory is formally identical to a process of Bayesian inference, which is studied in the field of machine learning using ideas from information geometry. Thus, in this simple model, a population of organisms can be thought of as a 'hypothesis' about how to survive, and natural selection acts to update this hypothesis according to Bayes' rule. The question thus arises to what extent natural changes in biodiversity can be usefully seen as analogous to a form of learning. However, some of the same mathematical structures arise in the study of chemical reaction networks, where the increase of entropy, or more precisely decrease of free energy, is not usually considered a form of 'learning'. We report on some preliminary work on these issues.
So, let's dive in! To some extent I'll be explaining these two papers:
However, I hope to bring in some more ideas from physics, the study of biodiversity, and the theory of stochastic Petri nets, also known as chemical reaction networks. So, this series may start to overlap with my network theory posts. We'll see. We won't get far today: for now, I just want to review and expand on what we did last time.
The replicator equation is a simplified model of how populations change. Suppose we have $n$ types of selfreplicating entity. I'll call these entities replicators. I'll call the types of replicators species, but they don't need to be species in the biological sense. For example, the replicators could be genes, and the types could be alleles. Or the replicators could be restaurants, and the types could be restaurant chains.
Let $P_i(t),$ or just $P_i$ for short, be the population of the $i$th species at time $t.$ Then the replicator equation says
$$ \displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i } $$
So, the population $P_i$ changes at a rate proportional to $P_i,$ but the 'constant of proportionality' need not be constant: it can be any smooth function $f_i$ of the populations of all the species. We call $f_i(P_1, \dots, P_n)$ the fitness of the $i$th species.
Of course this model is absurdly general, while still leaving out lots of important effects, like the spatial variation of populations, or the ability for the population of some species to start at zero and become nonzero—which happens thanks to mutation. Nonetheless this model is worth taking a good look at.
Using the magic of vectors we can write
$$ P = (P_1, \dots , P_n)$$
and
$$ f(P) = (f_1(P), \dots, f_n(P))$$
This lets us write the replicator equation a wee bit more tersely as
$$ \displaystyle{ \frac{d P}{d t} = f(P) P} $$
where on the right I'm multiplying vectors componentwise, the way your teachers tried to brainwash you into never doing:
$$ f(P) P = (f(P)_1 P_1, \dots, f(P)_n P_n) $$
In other words, I'm thinking of $P$ and $f(P)$ as functions on the set $\{1, \dots, n\}$ and multiplying them pointwise. This will be a nice way of thinking if we want to replace this finite set by some more general space.
Why would we want to do that? Well, we might be studying lizards with different length tails, and we might find it convenient to think of the set of possible tail lengths as the halfline $[0,\infty)$ instead of a finite set.
Or, just to get started, we might want to study the pathetically simple case where $f(P)$ doesn't depend on $P.$ Then we just have a fixed function $f$ and a timedependent function $P$ obeying
$$ \displaystyle{ \frac{d P}{d t} = f P} $$
If we're physicists, we might write $P$ more suggestively as $\psi$ and write the operator multiplying by $f$ as $ H.$ Then our equation becomes
$$ \displaystyle{ \frac{d \psi}{d t} =  H \psi } $$
This looks a lot like Schrödinger's equation, but since there's no factor of $\sqrt{1},$ and $\psi$ is realvalued, it's more like the heat equation or the 'master equation', the basic equation of stochastic mechanics.
For an explanation of Schrödinger's equation and the master equation, try Part 12 of the network theory series. In that post I didn't include a minus sign in front of the $H.$ That's no big deal: it's just a different convention than the one I want today. A more serious issue is that in stochastic mechanics, $\psi$ stands for a probability distribution. This suggests that we should get probabilities into the game somehow.
Luckily, that's exactly what people usually do! Instead of talking about the population $P_i$ of the $i$th species, they talk about the probability $p_i$ that one of our organisms will belong to the $i$th species. This amounts to normalizing our populations:
$$ \displaystyle{ p_i = \frac{P_i}{\sum_j P_j} } $$
Don't you love it when notations work out well? Our big Population $P_i$ has gotten normalized to give little probability $p_i.$
How do these probabilities $p_i$ change with time? Now is the moment for that least loved rule of elementary calculus to come out and take a bow: the quotient rule for derivatives!
$$ \displaystyle{ \frac{d p_i}{d t} = \left(\frac{d P_i}{d t} \sum_j P_j \quad  \quad P_i \sum_j \frac{d P_j}{d t}\right) / \left( \sum_j P_j \right)^2 }$$
Using our earlier version of the replicator equation, this gives:
$$ \displaystyle{ \frac{d p_i}{d t} = \left(f_i(P) P_i \sum_j P_j \quad  \quad P_i \sum_j f_j(P) P_j \right) / \left( \sum_j P_j \right)^2 }$$
Using the definition of $p_i,$ this simplifies to:
$$ \displaystyle{ \frac{d p_i}{d t} = f_i(P) p_i \quad  \quad \left( \sum_j f_j(P) p_j \right) p_i }$$
The stuff in parentheses actually has a nice meaning: it's just the mean fitness. In other words, it's the average, or expected, fitness of an organism chosen at random from the whole population. Let's write it like this:
$$ \displaystyle{ \langle f(P) \rangle = \sum_j f_j(P) p_j } $$
So, we get the replicator equation in its classic form:
$$ \displaystyle{ \frac{d p_i}{d t} = \Big( f_i(P)  \langle f(P) \rangle \Big) \, p_i }$$
This has a nice meaning: for the fraction of organisms of the $i$th type to increase, their fitness must exceed the mean fitness. If you're trying to increase market share, what matters is not how good you are, but how much better than average you are. If everyone else is lousy, you're in luck.
Now for something a bit new. Once we've gotten a probability distribution into the game, its entropy is sure to follow:
$$ \displaystyle{ S(p) =  \sum_i p_i \, \ln(p_i) } $$
This says how 'smearedout' the overall population is among the various different species. Alternatively, it says how much information it takes, on average, to say which species a randomly chosen organism belongs to. For example, if there are $2^N$ species, all with equal populations, the entropy $S$ works out to $N \ln 2.$ So in this case, it takes $N$ bits of information to say which species a randomly chosen organism belongs to.
In biology, entropy is one of many ways people measure biodiversity. For a quick intro to some of the issues involved, try:
But we don't need to understand this stuff to see how entropy is connected to the replicator equation. Marc Harper's paper explains this in detail:
and I hope to go through quite a bit of it here. But not today! Today I just want to look at a pathetically simple, yet still interesting, example.
Suppose the fitness of each species is independent of the populations of all the species. In other words, suppose each fitness $f_i(P)$ is actually a constant, say $f_i.$ Then the replicator equation reduces to
$$ \displaystyle{ \frac{d P_i}{d t} = f_i \, P_i } $$
so it's easy to solve:
$$ P_i(t) = e^{t f_i} P_i(0)$$
You don't need a detailed calculation to see what's going to happen to the probabilities
$$ \displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)}} $$
The most fit species present will eventually take over! If one species, say the $i$th one, has a fitness greater than the rest, then the population of this species will eventually grow faster than all the rest, at least if its population starts out greater than zero. So as $t \to +\infty,$ we'll have
$$ p_i(t) \to 1$$
and
$$ p_j(t) \to 0 \quad \mathrm{for} \quad j \ne i$$
Thus the probability distribution $p$ will become more sharply peaked, and its entropy will eventually approach zero.
With a bit more thought you can see that even if more than one species shares the maximum possible fitness, the entropy will eventually decrease, though not approach zero.
In other words, the biodiversity will eventually drop as all but the most fit species are overwhelmed. Of course, this is only true in our simple idealization. In reality, biodiversity behaves in more complex ways—in part because species interact, and in part because mutation tends to smear out the probability distribution $p_i.$ We're not looking at these effects yet. They're extremely important... in ways we can only fully understand if we start by looking at what happens when they're not present.
In still other words, the population will absorb information from its environment. This should make intuitive sense: the process of natural selection resembles 'learning'. As fitter organisms become more common and less fit ones die out, the environment puts its stamp on the probability distribution $p.$ So, this probability distribution should gain information.
While intuitively clear, this last claim also follows more rigorously from thinking of entropy as negative information. Admittedly, it's always easy to get confused by minus signs when relating entropy and information. A while back I said the entropy
$$ \displaystyle{ S(p) =  \sum_i p_i \, \ln(p_i) } $$
was the average information required to say which species a randomly chosen organism belongs to. If this entropy is going down, isn't the population losing information?
No, this is a classic sign error. It's like the concept of 'work' in physics. We can talk about the work some system does on its environment, or the work done by the environment on the system, and these are almost the same... except one is minus the other!
When you are very ignorant about some system—say, some rolled dice—your estimated probabilities $p_i$ for its various possible states are very smearedout, so the entropy $S(p)$ is large. As you gain information, you revise your probabilities and they typically become more sharply peaked, so $S(p)$ goes down. When you know as much as you possibly can, $S(p)$ equals zero.
So, the entropy $S(p)$ is the amount of information you have left to learn: the amount of information you lack, not the amount you have. As you gain information, this goes down. There's no paradox here.
It works the same way with our population of replicators—at least in the special case where the fitness of each species is independent of its population. The probability distribution $p$ is like a 'hypothesis' assigning to each species $i$ the probability $p_i$ that it's the best at selfreplicating. As some replicators die off while others prosper, they gather information their environment, and this hypothesis gets refined. So, the entropy $S(p)$ drops.
Of course, to make closer contact to reality, we need to go beyond the special case where the fitness of each species is a constant! Marc Harper does this, and I want to talk about his work someday, but first I have a few more remarks to make about the pathetically simple special case I've been focusing on. I'll save these for next time, since I've probably strained your patience already.
You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!
