Information Geometry (Part 20)

August 14, 2021

Information Geometry (Part 20)

John Baez

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

	Classical Mechanics	Thermodynamics	Probability Theory
q	position	extensive variables	probabilities
p	momentum	intensive variables	surprisals
S	action	entropy	Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory — and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I'll start by discussing 'statistical manifolds'. I introduced the idea of a statistical manifold in Part 7: it's a manifold $Q$ equipped with a map sending each point $q \in Q$ to a probability distribution $\pi_q$. Now I'll say how these fit into the second column of the above chart.

Then I'll talk about statistical manifolds of a special sort used in thermodynamics, which I'll call 'Gibbsian', since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each $q \in Q$ the probability distribution $\pi_q$ is a 'Gibbs distribution'. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point $q \in Q$.

More precisely: in a Gibbsian statistical manifold we have a list of observables $A_1, \dots , A_n$ whose expected values serve as coordinates $q_1, \dots, q_n$ for points $q \in Q$, and $\pi_q$ is the probability distribution that maximizes entropy subject to the constraint that the expected value of $A_i$ is $q_i$. We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let's fix a measure space $\Omega$ with measure $\mu$. A statistical manifold is then a manifold $Q$ equipped with a smooth map $\pi$ assigning to each point $q \in Q$ a probability distribution on $\Omega$, which I'll call $\pi_q$. So, $\pi_q$ is a function on $\Omega$ with

$$ \displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }$$

and

$$ \pi_q(x) \ge 0$$

for all $x \in \Omega$.

The idea here is that the space of all probability distributions on $\Omega$ may be too huge to understand in as much detail as we'd like, so instead we describe some of these probability distributions — a family parametrized by points of some manifold $Q$ — using the map $\pi$. This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the 'Fisher information metric', a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle $T Q$, which is important in Amari's approach to information geometry. You can read about this here:

Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don't want to talk about it now — I just wanted to reassure you that I'm not completely ignorant of it!

I want to focus on the story I've been telling, which is about entropy. Our statistical manifold $Q$ comes with a smooth entropy function

$$ f \colon Q \to \mathbb{R}$$

namely

$$ \displaystyle{ f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x) } $$

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point $q \in Q$ where this function is differentiable, its differential gives a cotangent vector

$$ p = (df)_q $$

which has an important physical meaning. In coordinates we have

$$ \displaystyle{ p_i = \frac{\partial f}{\partial q_i} } $$

and we call $p_i$ the intensive variable conjugate to $q_i$. For example if $q_i$ is energy, $p_i$ will be 'coolness': the reciprocal of temperature.

Defining $p$ this way gives a Lagrangian submanifold

$$ \Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p = (df)_x \} $$

of the cotangent bundle $T^\ast Q$. We can also get contact geometry into the game by defining a contact manifold $T^\ast Q \times \mathbb{R}$ and a Legendrian submanifold

$$ \Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p = (df)_q , S = f(q) \}$$

But I've been talking about these ideas for the last three episodes, so I won't say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I'll call 'Gibbsian'. In these, each probability distribution $\pi_q$ is a 'Gibbs distribution', meaning that it maximizes entropy subject to certain constraints specified by the point $q \in Q$.

How does this work? For starters, an integrable function

$$ A \colon \Omega \to \mathbb{R}$$

is called a random variable, or in physics perhaps an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

$$ \langle A \rangle \colon Q \to \mathbb{R} $$

given by

$$ \displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) } $$

In other words, $\langle A \rangle$ is a function whose value at at any point $q \in Q$ is the expected value of $A$ with respect to the probability distribution $\pi_q$.

Now, suppose our statistical manifold is $n$-dimensional and we have $n$ observables $A_1, \dots, A_n$. Their expected values will be smooth functions on our manifold — and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it's really not so outlandish. Indeed, if there's a point $q$ such that the differentials of the functions $\langle A_i \rangle$ are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions $\langle A_i \rangle$ will be coordinates.

So, let's assume the expected values of our observables give a coordinate system on $Q$. Let's call these coordinates $q_1, \dots, q_n$, so that

$$ \langle A_i \rangle(q) = q_i $$

Now for the kicker: we say our statistical manifold is Gibbsian if for each point $q \in Q$, $\pi_q$ is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

$$ \displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i } $$

for all i. This is just the previous equation spelled out so that you can see it's a condition on $\pi_q$.

This assumption of the entropy-maximizing nature of $\pi_q$ is a very powerful, because it implies a useful and nontrivial formula for $\pi_q$. It's called the Gibbs distribution:

$$ \displaystyle{ \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }$$

for all $x \in \Omega$.

Here $p_i$ is the intensive variable conjugate to $q_i$, while $Z(q)$ is the partition function: the thing we must divide by to make sure $\pi_q$ integrates to 1. In other words:

$$ \displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x) } $$

By the way, this formula may look confusing at first, since the left side depends on the point $q$ in our statistical manifold, while there's no $q$ visible in the right side! Do you see what's going on?

I'll tell you: the conjugate variable $p_i$, sitting on the right side of the above formula, depends on $q$. Remember, we got it by taking the partial derivative of the entropy in the $q_i$ direction

$$ \displaystyle{ p_i = \frac{\partial f}{\partial q_i} } $$

and the evaluating this derivative at the point $q$.

But wait a minute! $f$ here is the entropy — but the entropy of what?

The entropy of $\pi_q$, of course!

So there's something circular about our formula for $\pi_q$. To know $\pi_q$, you need to know the conjugate variables $p_i$, but to compute these you need to know the entropy of $\pi_q$.

This is actually okay. While circular, the formula for $\pi_q$ is still true. It's harder to work with than you might hope. But it's still extremely useful.

Next time I'll prove that this formula for $\pi_q$ is true, and do a few things with it. All this material was discovered by Josiah Willard Gibbs in the late 1800's, and it's lurking any good book on statistical mechanics — but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

$q_1$ is energy, $p_1$ is 1/temperature.
$q_2$ is volume, $p_2$ is –pressure/temperature.
$q_3$ is the number of particles, $p_3$ is chemical potential / temperature.

While these special cases are important and interesting, I'd rather be general!

Technical comments

I said "Any statistical manifold comes with a bunch of interesting geometrical structures", but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map $\pi$. For example, if $\pi$ maps every point of $Q$ to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function $f$ is only smooth under some conditions on $\pi$.

Furthermore, the integral

$$ \displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x) } $$

may not converge for all values of the numbers $p_1, \dots, p_n$. But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution $\pi_q$ with

$$ \displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i } $$

actually exists. In this case the probability distribution is also unique (almost everywhere).

You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!