Information Geometry (Part 3)

October 25, 2010

Information Geometry (Part 3)

John Baez

So far in this series of posts I've been explaining a paper by Gavin Crooks. Now I want to go ahead and explain a little research of my own.

I'm not claiming my results are new—indeed I have no idea whether they are, and I'd like to hear from any experts who might know. I'm just claiming that this is some work I did last weekend.

People sometimes worry that if they explain their ideas before publishing them, someone will 'steal' them. But I think this overestimates the value of ideas, at least in esoteric fields like mathematical physics. The problem is not people stealing your ideas: the hard part is giving them away. And let's face it, people in love with math and physics will do research unless you actively stop them. I'm reminded of this scene from the Marx Brothers movie where Harpo and Chico, playing wandering musicians, walk into a hotel and offer to play:

Groucho: What do you fellows get an hour?

Chico: Oh, for playing we getta ten dollars an hour.

Groucho: I see...What do you get for not playing?

Chico: Twelve dollars an hour.

Groucho: Well, clip me off a piece of that.

Chico: Now, for rehearsing we make special rate. Thatsa fifteen dollars an hour.

Groucho: That's for rehearsing?

Chico: Thatsa for rehearsing.

Groucho: And what do you get for not rehearsing?

Chico: You couldn't afford it.

So, I'm just rehearsing in public here—but I of course I hope to write a paper about this stuff someday, once I get enough material.

Remember where we were. We had considered a manifold—let's finally give it a name, say $M$—that parametrizes Gibbs states of some physical system. By Gibbs state, I mean a state that maximizes entropy subject to constraints on the expected values of some observables. And we had seen that in favorable cases, we get a Riemannian metric on $M$! It looks like this:

$$g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) (X_j - \langle X_j \rangle) \rangle $$

where $X_i$ are our observables, and the angle bracket means 'expected value'.

All this applies to both classical or quantum mechanics. Crooks wrote down a beautiful formula for this metric in the classical case. But since I'm at the Centre for Quantum Technologies, not the Centre for Classical Technologies, I redid his calculation in the quantum case. The big difference is that in quantum mechanics, observables don't commute! But in the calculations I did, that didn't seem to matter much—mainly because I took a lot of traces, which imposes a kind of commutativity:

$$ \mathrm{tr}(AB) = \mathrm{tr}(BA) $$

In fact, if I'd wanted to show off, I could have done the classical and quantum cases simultaneously by replacing all operators by elements of any von Neumann algebra equipped with a trace. Don't worry about this much: it's just a general formalism for treating classical and quantum mechanics on an equal footing. One example is the algebra of bounded operators on a Hilbert space, with the usual concept of trace. Then we're doing quantum mechanics as usual. But another example is the algebra of suitably nice functions on a suitably nice space, where taking the trace of a function means integrating it. And then we're doing classical mechanics!

For example, I showed you how to derive a beautiful formula for the metric I wrote down a minute ago:

$$ g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} ) $$

But if we want to do the classical version, we can say Hey, presto! and write it down like this:

$$ g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega $$

What did I do just now? I changed the trace to an integral over some space $\Omega$. I rewrote $\rho$ as $p$ to make you think 'probability distribution'. And I don't need to take the real part anymore, since is everything already real when we're doing classical mechanics. Now this metric is the Fisher information metric that statisticians know and love!

In what follows, I'll keep talking about the quantum case, but in the back of my mind I'll be using von Neumann algebras, so everything will apply to the classical case too.

So what am I going to do? I'm going to fix a big problem with the story I've told so far.

Here's the problem: so far we've only studied a special case of the Fisher information metric. We've been assuming our states are Gibbs states, parametrized by the expectation values of some observables $X_1, \dots, X_n$. Our manifold $M$ was really just some open subset of $\mathbb{R}^n$: a point in here was a list of expectation values.

But people like to work a lot more generally. We could look at any smooth function $\rho$ from a smooth manifold $M$ to the set of density matrices for some quantum system. We can still write down the metric

$$ g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} ) $$

in this more general situation. Nobody can stop us! But it would be better if we could derive this formula, as before, starting from a formula like the one we had before:

$$ g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle) \, \rangle $$

The challenge is that now we don't have observables $X_i$ to start with. All we have is a smooth function $\rho$ from some manifold to some set of states. How can we pull observables out of thin air?

Well, you may remember that last time we had

$$ \rho = \frac{1}{Z} e^{-\lambda^i X_i}$$

where $\lambda^i$ were some functions on our manifold and

$$ Z = \mathrm{tr}(e^{-\lambda^i X_i})$$

was the partition function. Let's copy this idea.

So, we'll start with our density matrix $\rho$, but then write it as

$$ \rho = \frac{1}{Z} e^{-A}$$

where $A$ is some self-adjoint operator and

$$ Z = \mathrm{tr} (e^{-A})$$

(Note that $A$, like $\rho$, is really an operator-valued function on $M$. So, I should write something like $A(x)$ to denote its value at a particular point $x \in M$, but I won't usually do that. As usual, I expect some intelligence on your part!)

Now we can repeat some calculations I did last time. As before, let's take the logarithm of $\rho$:

$$ \mathrm{ln} \, \rho = -A - \mathrm{ln}\, Z$$

and then differentiate it. Suppose $\lambda^i$ are local coordinates near some point of $M$. Then

$$ \frac{\partial}{\partial \lambda^i} \mathrm{ln}\, \rho = - \frac{\partial}{\partial \lambda^i} A - \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z$$

Last time we had nice formulas for both terms on the right-hand side above. To get similar formulas now, let's define operators

$$ X_i = \frac{\partial}{\partial \lambda^i} A$$

This gives a nice name to the first term on the right-hand side above. What about the second term? We can calculate it out:

$$ \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = \frac{1}{Z} \frac{\partial }{\partial \lambda^i} \mathrm{tr}(e^{-A}) = \frac{1}{Z} \mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = - \frac{1}{Z} \mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)$$

where in the last step we use the chain rule. Next, use the definition of $\rho$ and $X_i$, and get:

$$ \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = - \mathrm{tr}(\rho X_i) = - \langle X_i \rangle$$

This is just what we got last time! Ain't it fun to calculate when it all works out so nicely?

So, putting both terms together, we see

$$ \frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho = - X_i + \langle X_i \rangle $$

or better:

$$ X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho$$

This is a nice formula for the 'fluctuation' of the observables $X_i$, meaning how much they differ from their expected values. And it looks exactly like the formula we had last time! The difference is that last time we started out assuming we had a bunch of observables, $X_i$, and defined $\rho$ to be the state maximizing the entropy subject to constraints on the expectation values of all these observables. Now we're starting with $\rho$ and working backwards.

From here on out, it's easy. As before, we can define $g_{ij}$ to be the real part of the covariance matrix:

$$ g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) (X_j - \langle X_j \rangle) \rangle $$

Using the formula

$$ X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho$$

we get

$$ g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j} \rangle $$

$$ g_{ij} = \mathrm{Re}\,\mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j}) $$

Voilà!

When this matrix is positive definite at every point, we get a Riemanian metric on $M$. Last time I said this is what people call the 'Bures metric'—though frankly, now that I examine the formulas, I'm not so sure. But in the classical case, it's called the Fisher information metric.

Differential geometers like to use $\partial_i$ as a shorthand for $\frac{\partial}{\partial_i}$, so they'd write down our metric in a prettier way:

$$ g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) )$$

Differential geometers like coordinate-free formulas, so let's also give a coordinate-free formula for our metric. Suppose $x \in M$ is a point in our manifold, and suppose $v,w$ are tangent vectors to this point. Then

$$ g(v,w) = \mathrm{Re} \, \langle v(\mathrm{ln}\, \rho) \; w(\mathrm{ln} \,\rho) \rangle \; = \; \mathrm{Re} \,\mathrm{tr}(\rho \; v(\mathrm{ln}\, \rho) \; w(\mathrm{ln}\, \rho)) $$

Here $\mathrm{ln}\, \rho$ is a smooth operator-valued function on $M$, and $v(\mathrm{ln}\, \rho)$ means the derivative of this function in the $v$ direction at the point $x$.

So, this is all very nice. To conclude, two more points: a technical one, and a more important philosophical one.

First, the technical point. When I said $\rho$ could be any smooth function from a smooth manifold to some set of states, I was actually lying. That's an important pedagogical technique: the brazen lie.

We can't really take the logarithm of every density matrix. Remember, we take the log of a density matrix by taking the log of all its eigenvalues. These eigenvalues are ≥ 0, but if one of them is zero, we're in trouble! The logarithm of zero is undefined.

On the other hand, there's no problem taking the logarithm of our density-matrix-valued function $\rho$ when it's positive definite at each point of $M$. You see, a density matrix is positive definite iff its eigenvalues are all > 0. In this case it has a unique self-adjoint logarithm.

So, we must assume $\rho$ is positive definite. But what's the physical significance of this 'positive definiteness' condition? Well, any density matrix can be diagonalized using some orthonormal basis. It can then be seen as probabilistic mixture—not a quantum superposition!—of pure states taken from this basis. Its eigenvalues are the probabilities of finding the mixed state to be in one of these pure states. So, saying that all its eigenvalues are all > 0 amounts to saying that all the pure states in this orthonormal basis show up with nonzero probability! Intuitively, this means our mixed state is 'really mixed'. For example, it can't be a pure state. In math jargon, it means our mixed state is in the interior of the convex set of mixed states.

Second, the philosophical point. Instead of starting with the density matrix $\rho$, I took $A$ as fundamental. But different choices of $A$ give the same $\rho$. After all,

$$ \rho = \frac{1}{Z} e^{-A}$$

where we cleverly divide by the normalization factor

$$ Z = \mathrm{tr} (e^{-A})$$

to get $\mathrm{tr} \, \rho = 1$. So, if we multiply $e^{-A}$ by any positive constant, or indeed any positive function on our manifold $M$, $\rho$ will remain unchanged!

So we have added a little extra information when switching from $\rho$ to $A$. You can think of this as 'gauge freedom', because I'm saying we can do any transformation like

$$ A \mapsto A + f $$

where

$$ f: M \to \mathbb{R}$$

is a smooth function. This doesn't change $\rho$, so arguably it doesn't change the 'physics' of what I'm doing. It does change $Z$. It also changes the observables

$$ X_i = \frac{\partial}{\partial \lambda^i} A$$

But it doesn't change their 'fluctuations'

$$ X_i - \langle X_i \rangle$$

so it doesn't change the metric $g_{ij}$.

This gauge freedom is interesting, and I want to understand it better. It's related to something very simple yet mysterious. In statistical mechanics the partition function $Z$ begins life as 'just a normalizing factor'. If you change the physics so that $Z$ gets multiplied by some number, the Gibbs state doesn't change. But then the partition function takes on an incredibly significant role as something whose logarithm you differentiate to get lots of physically interesting information! So in some sense the partition function doesn't matter much... but changes in the partition function matter a lot.

This is just like the split personality of phases in quantum mechanics. On the one hand they 'don't matter': you can multiply a unit vector by any phase and the pure state it defines doesn't change. But on the other hand, changes in phase matter a lot.

Indeed the analogy here is quite deep: it's the analogy between probabilities in statistical mechanics and amplitudes in quantum mechanics, the analogy between $\mathrm{exp}(-\beta H)$ in statistical mechanics and $\mathrm{exp}(-i t H / \hbar)$ in quantum mechanics, and so on. This is part of a bigger story about 'rigs' which I told back in the Winter 2007 quantum gravity seminar, especially in week13. So, it's fun to see it showing up yet again... even though I don't completely understand it here.

You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!