August 17, 2021

Information Geometry (Part 21)

John Baez

Last time I ended with a formula for the 'Gibbs distribution': the probability distribution that maximizes entropy subject to constraints on the expected values of some observables.

This formula is well-known, but I'd like to derive it here. My argument won't be up to the highest standards of rigor: I'll do a bunch of computations, and it would take more work to state conditions under which these computations are justified. But even a nonrigorous approach is worthwhile, since the computations will give us more than the mere formula for the Gibbs distribution.

I'll start by reminding you of what I claimed last time. I'll state it in a way that removes all unnecessary distractions, so go back to Part 20 if you want more explanation.

The Gibbs distribution

Take a measure space Ω with measure μ. Suppose there is a probability distribution π on Ω that maximizes the entropy

Ωπ(x)lnπ(x)dμ(x)

subject to the requirement that some integrable functions A1,,An on Ω have expected values equal to some chosen list of numbers q1,,qn.

(Unlike last time, now I'm writing Ai and qi with superscripts rather than subscripts, because I'll be using the Einstein summation convention: I'll sum over any repeated index that appears once as a a superscript and once as a subscript.)

Furthermore, suppose π depends smoothly on qRn. I'll call it πq to indicate its dependence on q. Then, I claim πq is the so-called Gibbs distribution

πq(x)=epiAi(x)ΩepiAi(x)dμ(x)

where

pi=f(q)qi

and

f(q)=Ωπq(x)lnπq(x)dμ(x)

is the entropy of πq.

Let's show this is true!

Finding the Gibbs distribution

So, we are trying to find a probability distribution π that maximizes entropy subject to these constraints:

Ωπ(x)Ai(x)dμ(x)=qi

We can solve this problem using Lagrange multipliers. We need one Lagrange multiplier, say βi, for each of the above constraints. But it's easiest if we start by letting π range over all of L1(Ω), that is, the space of all integrable functions on Ω. Then, because we want π to be a probability distribution, we need to impose one extra constraint

Ωπ(x)dμ(x)=1

To do this we need an extra Lagrange multiplier, say γ.

So, that's what we'll do! We'll look for critical points of this function on L1(Ω):

πlnπdμβiπAidμγπdμ

Here I'm using some tricks to keep things short. First, I'm dropping the dummy variable x which appeared in all of the integrals we had: I'm leaving it implicit. Second, all my integrals are over Ω so I won't say that. And third, I'm using the Einstein summation convention, so there's a sum over i implicit here.

Okay, now let's do the variational derivative required to find a critical point of this function. When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques. I don't feel like a massive digression into this right now, so I'll just do the calculations — and if they seem like black magic, I'm sorry!

We need to find π obeying

δδπ(x)(πlnπdμβiπAidμγπdμ)=0

or in other words

δδπ(x)(πlnπdμ+βiπAidμ+γπdμ)=0

First we need to simplify this expression. The only part that takes any work, if you know how to do variational derivatives, is the first term. Since the derivative of zlnz is 1+lnz, we have

δδπ(x)πlnπdμ=1+lnπ(x)

The second and third terms are easy, so we get

δδπ(x)(πlnπdμ+βiπAidμ+γπdμ)=1+lnπ(x)+βiAi(x)+γ

Thus, we need to solve this equation:

1+lnπ(x)+βiAi(x)+γ=0

That's easy to do:

π(x)=e1γβiAi(x)

Good! It's starting to look like the Gibbs distribution!

We now need to choose the Lagrange multipliers βi and γ to make the constraints hold. To satisfy this constraint

πdμ=1

we must choose γ so that

e1γβiAidμ=1

or in other words

e1+γ=eβiAidμ

Plugging this into our earlier formula

π(x)=e1γβiAi(x)

we get this:

πq(x)=eβiAi(x)eβiAidμ

Great! Even more like the Gibbs distribution!

By the way, you must have noticed the "1" that showed up here:

δδπ(x)πlnπdμ=1+lnπ(x)

It buzzed around like an annoying fly in the otherwise beautiful calculation, but eventually went away. This is the same irksome "1" that showed up in Part 19. Someday I'd like to say a bit more about it.

Now, where were we? We were trying to show that

πq(x)=epiAi(x)epiAidμ

minimizes entropy subject to our constraints. So far we've shown

π(x)=eβiAi(x)eβiAidμ

is a critical point. It's clear that

π(x)0

so π really is a probability distribution. We should show it actually maximizes entropy subject to our constraints, but I will skip that. Given that, π will be our claimed Gibbs distribution πq if we can show

pi=βi

This is interesting! It's saying our Lagrange multipliers βi actually equal the so-called conjugate variables pi given by

pi=fqi

where f(q) is the entropy of πq:

f(q)=Ωπq(x)lnπq(x)dμ(x)

There are two ways to show this: the easy way and the hard way. The easy way is to reflect on the meaning of Lagrange multipliers, and I'll sketch that way first. The hard way is to use brute force: just compute pi and show it equals βi. This is a good test of our computational muscle — but more importantly, it will help us rediscover some interesting facts about the Gibbs distribution.

The easy way

Consider a simple Lagrange multiplier problem where you're trying to find a critical point of a smooth function

f:R2R

subject to the constraint

g=c

for some smooth function

g:R2R

and constant c. (The function f here has nothing to do with the f in the previous sections, which stood for entropy.) To answer this we introduce a Lagrange multiplier λ and seek points where

(fλg)=0

This works because the above equation says

f=λg

Geometrically this means we're at a point where the gradient of f points at right angles to the level surface of g:

Thus, to first order we can't change f by moving along the level surface of g.

But also, if we start at a point where

f=λg

and we begin moving in any direction, the function f will change at a rate equal to λ times the rate of change of g. That's just what the equation says! And this fact gives a conceptual meaning to the Lagrange multiplier λ.

Our situation is more complicated, since our functions are defined on the infinite-dimensional space L1(Ω), and we have an n-tuple of constraints with an n-tuple of Lagrange multipliers. But the same principle holds.

So, when we are at a solution πq of our constrained entropy-maximization problem, and we start moving the point πq by changing the value of the ith constraint, namely qi, the rate at which the entropy changes will be βi times the rate of change of qi. So, we have

fqi=βi

But this is just what we needed to show!

The hard way

Here's another way to show

fqi=βi

We start by solving our constrained entropy-maximization problem using Lagrange multipliers. As already shown, we get

πq(x)=eβiAi(x)eβiAidμ

Then we'll compute the entropy

f(q)=πqlnπqdμ

Then we'll differentiate this with respect to qi and show we get βi.

Let's try it! The calculation is a bit heavy, so let's write Z(q) for the so-called partition function

Z(q)=eβiAidμ

so that

πq(x)=eβiAi(x)Z(q)

and the entropy is

\[ f(q)=πqln(eβkAkZ(q))dμ=πq(βkAk+lnZ(q))dμ \)

This is the sum of two terms. The first term

πqβkAkdμ=βkπqAkdμ

is βk times the expected value of Ak with respect to the probability distribution πq, all summed over k. But the expected value of Ak is qk, so we get

πqβkAkdμ=βkqk

The second term is easier:

ΩπqlnZ(q)dμ=lnZ(q)

since πq(x) integrates to 1 and the partition function Z(q) doesn't depend on xΩ.

Putting together these two terms we get an interesting formula for the entropy:

f(q)=βkqk+lnZ(q)

This formula is one reason this brute-force approach is actually worthwhile! I'll say more about it later.

But for now, let's use this formula to show what we're trying to show, namely

fqi=βi

For starters,

fqi=qi(βkqk+lnZ(q))=βkqiqk+βkqkqi+qilnZ(q)=βkqiqk+βkδik+qilnZ(q)=βkqiqk+βi+qilnZ(q)

where we played a little Kronecker delta game with the second term.

Now we just need to compute the third term:

qilnZ(q)=1Z(q)qiZ(q)=1Z(q)qieβjAjdμ=1Z(q)qi(eβjAj)dμ=1Z(q)qi(βkAk)eβjAjdμ=1Z(q)βkqiAkeβjAjdμ=βkqi1Z(q)AkeβjAjdμ=βkqiqk

Ah, you don't know how good it feels, after years of category theory, to be doing calculations like this again!

Now we can finish the job we started:

fqi=βkqiqk+βi+qilnZ(q)=βkqiqk+βiβkqiqk=βi

Voilà!

Conclusions

We've learned the formula for the probability distribution that maximizes entropy subject to some constraints on the expected values of observables. But more importantly, we've seen that the anonymous Lagrange multipliers βi that show up in this problem are actually the partial derivatives of entropy! They equal

pi=fqi

Thus, they are rich in meaning. From what we've seen earlier, they are 'surprisals'. They are analogous to momentum in classical mechanics and have the meaning of intensive variables in thermodynamics:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

Furthermore, by showing βi=pi the hard way we discovered an interesting fact. There's a relation between the entropy and the logarithm of the partition function:

f(q)=piqi+lnZ(q)

(We proved this formula with βi replacing pi, but now we know those are equal.)

This formula suggests that the logarithm of the partition function is important — and it is! It's closely related to the concept of free energy — even though 'energy', free or otherwise, doesn't show up at the level of generality we're working at now.

This formula should also remind you of the tautological 1-form on the cotangent bundle TQ, namely

θ=pidqi

It should remind you even more of the contact 1-form on the contact manifold TQ×R, namely

α=dS+pidqi

Here S is a coordinate on the contact manifold that's a kind of abstract stand-in for our entropy function f.

So, it's clear there's a lot more to say: we're seeing hints of things here and there, but not yet the full picture.


You can read a discussion of this article on Azimuth, and make your own comments or ask questions there!


© 2021 John Baez
baez@math.removethis.ucr.andthis.edu
home