Information Geometry (Part 6)

January 21, 2011

Information Geometry (Part 6)

John Baez

So far, my thread on information geometry hasn't said much about information. It's time to remedy that.

I've been telling you about the Fisher information metric. In statistics this is nice a way to define a 'distance' between two probability distributions. But it also has a quantum version.

So far I've showed you how to define the Fisher information metric in three equivalent ways. I also showed that in the quantum case, the Fisher information metric is the real part of a complex-valued thing. The imaginary part is related to the uncertainty principle.

But there's yet another way to define the Fisher information metric, which really involves information.

To explain this, I need to start with the idea of 'information gain', or 'relative entropy'. And it looks like I should do a whole post on this.

So:

Suppose that $\Omega$ is a measure space — that is, a space you can do integrals over. By a probability distribution on $\Omega$, I'll mean a nonnegative function

$$p : \Omega \to \mathbb{R}$$

whose integral is 1. Here $d \omega$ is my name for the measure on $\Omega$. Physicists might call $\Omega$ the 'phase space' of some classical system, but probability theorists might call it a space of 'events'. Today I'll use the probability theorist's language. The idea here is that

$$\int_A \; p(\omega) \; d \omega $$

gives the probability that when an event happens, it'll be one in the subset $A \subseteq \Omega$. That's why we want

$$p \ge 0$$

Probabilities are supposed to be nonnegative. And that's also why we want

$$\int_\Omega \; p(\omega) \; d \omega = 1 $$

This says that the probability of some event happening is 1.

Now, suppose we have two probability distributions on $\Omega$, say $p$ and $q$. The information gain as we go from $q$ to $p$ is

$$S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega $$

We also call this the entropy of $p$ relative to $q$. It says how much information you learn if you discover that the probability distribution of an event is $p$, if before you had thought it was $q$.

I like relative entropy because it's related to the Bayesian interpretation of probability. The idea here is that you can't really 'observe' probabilities as frequencies of events, except in some unattainable limit where you repeat an experiment over and over infinitely many times. Instead, you start with some hypothesis about how likely things are: a probability distribution called the prior. Then you update this using Bayes' rule when you gain new information. The updated probability distribution — your new improved hypothesis — is called the posterior.

And if you don't do the updating right, you need a swift kick in the posterior!

So, we can think of $q$ as the prior probability distribution, and $p$ as the posterior. Then $S(p,q)$ measures the amount of information that caused you to change your views.

For example, suppose you're flipping a coin, so your set of events is just

$$\Omega = \{ \mathrm{heads}, \mathrm{tails} \}$$

In this case all the integrals are just sums with two terms. Suppose your prior assumption is that the coin is fair. Then

$$q(\mathrm{heads}) = 1/2, \; q(\mathrm{tails}) = 1/2$$

But then suppose someone you trust comes up and says "Sorry, that's a trick coin: it always comes up heads!" So you update our probability distribution and get this posterior:

$$p(\mathrm{heads}) = 1, \; p(\mathrm{tails}) = 0 $$

How much information have you gained? Or in other words, what's the relative entropy? It's this:

$$S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega = 1 \cdot \log(\frac{1}{1/2}) + 0 \cdot \log(\frac{0}{1/2}) = 1 $$

Here I'm doing the logarithm in base 2, and you're supposed to know that in this game $0 \log 0 = 0$.

So: you've learned one bit of information!

That's supposed to make perfect sense. On the other hand, the reverse scenario takes a bit more thought.

You start out feeling sure that the coin always lands heads up. Then someone you trust says "No, that's a perfectly fair coin." If you work out the amount of information you learned this time, you'll see it's infinite.

Why is that?

The reason is that something that you thought was impossible — the coin landing tails up — turned out to be possible. In this game, it counts as infinitely shocking to learn something like that, so the information gain is infinite. If you hadn't been so darn sure of yourself — if you had just believed that the coin almost always landed heads up — your information gain would be large but finite.

The Bayesian philosophy is built into the concept of information gain, because information gain depends on two things: the prior and the posterior. And that's just as it should be: you can only say how much you learned if you know what you believed beforehand!

You might say that information gain depends on three things: $p$, $q$ and the measure $d \omega$. And you'd be right! Unfortunately, the notation $S(p,q)$ is a bit misleading. Information gain really does depend on just two things, but these things are not $p$ and $q$: they're $p(\omega) d\omega$ and $q(\omega) d\omega$. These are called probability measures, and they're ultimately more important than the probability distributions $p$ and $q$.

To see this, take our information gain:

$$\int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega $$

and juggle it ever so slightly to get this:

$$\int_\Omega \; \log(\frac{p(\omega) d\omega}{q(\omega)d \omega}) \; p(\omega) d \omega $$

Clearly this depends only on $p(\omega) d\omega$ and $q(\omega) d\omega$. Indeed, it's good to work directly with these probability measures and give them short names, like

$$d\mu = p(\omega) d \omega $$ $$d\nu = q(\omega) d \omega$$

Then the formula for information gain looks more slick:

$$ \int_\Omega \; \log(\frac{d\mu}{d\nu}) \; d\mu $$

And by the way, in case you're wondering, the $d$ here doesn't actually mean much: we're just so brainwashed into wanting a $d x$ in our integrals that people often use $d \mu$ for a measure even though the simpler notation $\mu$ might be more logical. So, the function

$$ \frac{d\mu}{d\nu} $$

is really just a ratio of probability measures, but people call it a Radon-Nikodym derivative, because it looks like a derivative (and in some important examples it actually is). So, if I were talking to myself, I could have shortened this blog entry immensely by working with directly probability measures, leaving out the $d$'s, and saying:

Suppose $\mu$ and $\nu$ are probability measures; then the entropy of $\mu$ relative to $\nu$, or information gain, is $ S(\mu, \nu) = \int_\Omega \; \log(\frac{\mu}{\nu}) \; \mu. $

But I'm under the impression that people are actually reading this stuff, and that most of you are happier with functions than measures. So, I decided to start with

$$S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega $$

and then gradually work my way up to the more sophisticated way to think about relative entropy! But having gotten that off my chest, now I'll revert to the original naive way.

As a warmup for next time, let me pose a question. How much is this quantity

$$S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega $$

like a distance between probability distributions? A distance function, or metric, is supposed to satisfy some axioms. Alas, relative entropy satisfies some of these, but not the most interesting one!

• If you've got a metric, the distance between points should always be nonnegative. Indeed, this holds:

$$S(p,q) \ge 0$$

So, we never learn a negative amount when we update our prior, at least according to this definition. It's a fun exercise to prove this inequality, at least if you know some tricks involving inequalities and convex functions — otherwise it might be hard.

• If you've got a metric, the distance between two points should only be zero if they're really the same point. In fact,

$$S(p,q) = 0$$

if and only if

$$p d\omega = q d \omega $$

It's possible to have $p d\omega = q d \omega $ even if $p \ne q$, because $d \omega$ can be zero somewhere. But this is just more evidence that we should really be talking about the probability measure $p d \omega$ instead of the probability distribution $p$. If we do that, we're okay so far!

• If you've got a metric, the distance from your first point to your second point is the same as the distance from the second to the first. Alas,

$$S(p,q) \ne S(q,p)$$

in general. We already saw this in our example of the flipped coin. This is a slight bummer, but I could live with it, since Lawvere has already shown that it's wise to generalize the concept of metric by dropping this axiom.

• If you've got a metric, it obeys the triangle inequality. This is the really interesting axiom, and alas, this too fails. Later we'll see why.

So, relative entropy does a fairly miserable job of acting like a distance function. People call it a divergence. In fact, they often call it the Kullback-Leibler divergence. I don't like that, because 'the Kullback-Leibler divergence' doesn't really explain the idea: it sounds more like the title of a bad spy novel, sort of like The Eiger Sanction only worse. 'Relative entropy', on the other hand, makes a lot of sense if you understand entropy. And 'information gain' makes sense if you understand information.

Anyway: how can we save this miserable attempt to get a distance function on the space of probability distributions? Simple: take its matrix of second derivatives and use that to define a Riemannian metric $g_{ij}$. This Riemannian metric in turn defines a metric of the more elementary sort we've been discussing today.

And this Riemannian metric is the Fisher information metric I've been talking about all along!

I'll give you the details next time.

You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!