Information Geometry (Part 19)

August 7, 2021

Information Geometry (Part 19)

John Baez

Last time I figured out the analogue of momentum in probability theory, but I didn't say what it's called. Now I will tell you — thanks to some help from Abel Jansma and Toby Bartels.

SURPRISE: it's called SURPRISAL!

This is a well-known concept in information theory. It's also called 'information content'.

Let's see why. First, let's remember the setup. We have a manifold

$Q = {q \in R^{n} : q_{i} > 0, \sum_{i = 1}^{n} q_{i} = 1}$

whose points $q$ are nowhere vanishing probability distributions on the set ${1, \dots, n}$ . We have a function

$f : Q \to R$

called the Shannon entropy, defined by

$f (q) = - \sum_{j = 1}^{n} q_{j} \ln q_{j}$

For each point $q \in Q$ we define a cotangent vector $p \in T_{q}^{*} Q$ by

$p = (d f)_{q}$

As mentioned last time, this is the analogue of momentum in probability theory. In the second half of this post I'll say more about exactly why. But first let's compute it and see what it actually equals!

Let's start with a naive calculation, acting as if the probabilities $q_{1}, \dots, q_{n}$ were a coordinate system on the manifold $Q$ . We get

$p_{i} = \frac{\partial f}{\partial q_{i}}$

so using the definition of the Shannon entropy we have

$\begin{array}{ccl} p_{i} & = & - \frac{\partial}{\partial q_{i}} \sum_{j = 1}^{n} q_{j} \ln q_{j} \\ = & - \frac{\partial}{\partial q_{i}} (q_{i} \ln q_{i}) \\ = & - \ln (q_{i}) - 1 \end{array}$

Now, the quantity $- \ln q_{i}$ is called the surprisal of the probability distribution at $i$ . Intuitively, it's a measure of how surprised you should be if an event of probability $q_{i}$ occurs. For example, if you flip a fair coin and it lands heads up, your surprisal is ln 2. If you flip 100 fair coins and they all land heads up, your surprisal is 100 times ln 2.

Of course 'surprise' is a psychological term, not a term from math or physics, so we shouldn't take it too seriously here. We can derive the concept of surprisal from three axioms:

The surprisal of an event of probability $q$ is some function of $q$ , say $F (q)$ .
The less probable an event is, the larger its surprisal is: $q_{1} \leq q_{2} ⟹ F (q_{1}) \geq F (q_{2})$
The surprisal of two independent events is the sum of their surprisals: $F (q_{1} q_{2}) = F (q_{1}) + F (q_{2})$ .

It follows from work on Cauchy's functional equation that $F$ must be of this form:

$F (q) = - \log_{b} q$

for some constant $b > 1$ . We shall choose $b$ , the base of our logarithms, to be $e$ . We had a similar freedom of choice in defining the Shannon entropy, and we will use base $e$ for both to be consistent. If we chose something else, it would change the surprisal and the Shannon entropy by the same constant factor.

So far, so good. But what about the irksome "-1" in our formula?

$p_{i} = - \ln (q_{i}) - 1$

Luckily it turns out we can just get rid of this! The reason is that the probabilities $q_{i}$ are not really coordinates on the manifold $Q$ . They're not independent: they must sum to 1. So, when we change them a little, the sum of their changes must vanish. Putting it more technically, the tangent space $T_{q} Q$ is not all of $R^{n}$ , but just the subspace consisting of vectors whose components sum to zero:

$T_{q} Q = {v \in R^{n} : \sum_{j = 1}^{n} v_{j} = 0}$

The cotangent space is the dual of the tangent space. The dual of a subspace

$S \subseteq V$

is the quotient space

$V^{*} / {ℓ : V \to R : \forall v \in S ℓ (v) = 0}$

The cotangent space $T_{q}^{*} Q$ thus consists of linear functionals $ℓ : R^{n} \to R$ modulo those that vanish on vectors $v$ obeying the equation

$\sum_{j = 1}^{n} v_{j} = 0$

Of course, we can identify the dual of $R^{n}$ with $R^{n}$ in the usual way, using the Euclidean inner product: a vector $u \in R^{n}$ corresponds to the linear functional

$ℓ (v) = \sum_{j = 1}^{n} u_{j} v_{j}$

From this, you can see that a linear functional $ℓ$ vanishes on all vectors $v$ obeying the equation

$\sum_{j = 1}^{n} v_{j} = 0$

if and only if its corresponding vector $u$ has

$u_{1} = \dots = u_{n}$

So, we get

$T_{q}^{*} Q ≅ R^{n} / {u \in R^{n} : u_{1} = \dots = u_{n}}$

In words: we can describe cotangent vectors to $Q$ as lists of n numbers if we want, but we have to remember that adding the same constant to each number in the list doesn't change the cotangent vector!

This suggests that our naive formula

$p_{i} = \ln (q_{i}) - 1$

is on the right track, but we're free to get rid of the constant 1 if we want! And that's true.

To check this rigorously, we need to show

$p (v) = - \sum_{j = 1}^{n} \ln (q_{i}) v_{i}$

for all $v \in T_{q} Q$ . We compute:

$\begin{array}{ccl} p (v) & = & d f (v) \\ = & v (f) \\ = & \sum_{j = 1}^{n} v_{j} \frac{\partial f}{\partial q_{j}} \\ = & \sum_{j = 1}^{n} v_{j} (- \ln (q_{i}) - 1) \\ = & - \sum_{j = 1}^{n} \ln (q_{i}) v_{i} \end{array}$

where in the second to last step we used our earlier calculation:

$\frac{\partial f}{\partial q_{i}} = - \frac{\partial}{\partial q_{i}} \sum_{j = 1}^{n} q_{j} \ln q_{j} = - \ln (q_{i}) - 1$

and in the last step we used

$\sum_{j} v_{j} = 0$

Back to the big picture

Now let's take stock of where we are. We can fill in the question marks in the charts from last time, and combine those charts while we're at it.

	Classical Mechanics	Thermodynamics	Probability Theory
q	position	extensive variables	probabilities
p	momentum	intensive variables	surprisals
S	action	entropy	Shannon entropy

What's going on here? In classical mechanics, action is minimized (or at least the system finds a critical point of the action). In thermodynamics, entropy is maximized. In the maximum entropy approach to probability, Shannon entropy is maximized. This leads to a mathematical analogy that's quite precise. For classical mechanics and thermodynamics, I explained it here:

These posts may give a more approachable introduction to what I'm doing now: now I'm bringing probability theory into the analogy, with a big emphasis on symplectic and contact geometry.

Let me spell out a bit of the analogy more carefully:

Classical mechanics. In classical mechanics, we have a manifold $Q$ whose points are positions of a particle. There's an important function on this manifold: Hamilton's principal function

$f : Q \to R$

What's this? It's basically action: $f (q)$ is the action of the least-action path from the position $q_{0}$ at some earlier time $t_{0}$ to the position $q$ at time 0. The Hamilton--Jacobi equations say the particle's momentum $p$ at time 0 is given by

$p = (d f)_{q}$
Thermodynamics. In thermodynamics, we have a manifold $Q$ whose points are equilibrium states of a system. The coordinates of a point $q \in Q$ are called extensive variables. There's an important function on this manifold: the entropy

$f : Q \to R$

There is a cotangent vector $p$ at the point $q$ given by

$p = (d f)_{q}$

The components of this vector are the intensive variables corresponding to the extensive variables.
Probability theory. In probability theory, we have a manifold $Q$ whose points are nowhere vanishing probability distributions on a finite set. The coordinates of a point $q \in Q$ are probabilities. There's an important function on this manifold: the Shannon entropy

$f : Q \to R$

There is a cotangent vector $p$ at the point $q$ given by

$p = (d f)_{q}$

The components of this vector are the surprisals corresponding to the probabilities.

In all three cases, $T^{*} Q$ is a symplectic manifold and imposing the constraint $p = (d f)_{q}$ picks out a Lagrangian submanifold

$Λ = {(q, p) \in T^{*} Q : p = (d f)_{q}}$

There is also a contact manifold $T^{*} Q \times R$ where the extra dimension comes with an extra coordinate $S$ that means

action in classical mechanics,
entropy in thermodynamics, and
Shannon entropy in probability theory.

We can then decree that $S = f (q)$ along with $p = (d f)_{q}$ , and these constraints pick out a Legendrian submanifold

$Σ = {(q, p, S) \in T^{*} Q \times R : S = f (q), p = (d f)_{q}}$

There's a lot more to do with these ideas, and I'll continue next time.

You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!