The prevalence of Kullback-Leibler
How's this for an explanation of the prevalence of the Kullback-Leibler divergence:
Much statistical inference takes the form of finding an optimal distribution satisfying some set of constraints. Very often these constraints are such that for any two distributions, P and Q, satisfying them, so do all mixtures of the form bP + (1 - b)Q. This is what Amari calls m-flatness (m for mixture), i.e., these paths of mixtures are geodesics with respect to the m-connection. Now, the dual affine connection to the m-connection is the e-connection (e for exponential), and e-flat manifolds of distributions are the ubiquitous exponential families. (To see e-flatness is not the same as m-flatness, consider that mixtures of Gaussians are not generally Gaussian.) Minimizing the relative KL-entropy of distributions satisfying the constraint is equivalent to finding where the exponential family meets the space of constrained distributions.
So my question is whether presenting the m-flatness idea first, as arising out of common-or-garden linear constraints, such as fixing the values of moments, is a good way to motivate the KL-divergence. Then it would be interesting to think about other types of constraints which would lead to flatness with other connections, and what the generalized exponential families, flat according to the dual affine connection, would look like.
Much statistical inference takes the form of finding an optimal distribution satisfying some set of constraints. Very often these constraints are such that for any two distributions, P and Q, satisfying them, so do all mixtures of the form bP + (1 - b)Q. This is what Amari calls m-flatness (m for mixture), i.e., these paths of mixtures are geodesics with respect to the m-connection. Now, the dual affine connection to the m-connection is the e-connection (e for exponential), and e-flat manifolds of distributions are the ubiquitous exponential families. (To see e-flatness is not the same as m-flatness, consider that mixtures of Gaussians are not generally Gaussian.) Minimizing the relative KL-entropy of distributions satisfying the constraint is equivalent to finding where the exponential family meets the space of constrained distributions.
So my question is whether presenting the m-flatness idea first, as arising out of common-or-garden linear constraints, such as fixing the values of moments, is a good way to motivate the KL-divergence. Then it would be interesting to think about other types of constraints which would lead to flatness with other connections, and what the generalized exponential families, flat according to the dual affine connection, would look like.
7 Comments:
Personally I don't see much difficulty motivating the "Kullback-Leibler divergence". Personally I'm shocked to hear that this notion I've known since college is called the "Kullback-Leibler divergence"! I've always heard it called "relative entropy". I first read about it in Everett's famous 1957 thesis, "On the Foundations of Quantum Mechanics", which started the "relative state" or "many-worlds" interpretation of quantum mechanics. I doubt Everett was the first to notice it. Were Kullback and Leibler the first to invent it? Who are these guys?
You've probably explained the entropy distance before, but let me do it again - just to show why such a beautiful simple concept shouldn't be saddled with a scary name like "Kullback-Leibler divergence". Important concepts deserve snappy, descriptive names.
There's a famous concept called entropy. All physicists know the formula:
S(p) = sum_i ln(p(i)) p(i)
where p is a probability measure on some finite set of states, and we sum over all the states i. It tells us how surpised we should be to find a system distributed according to the probabilities p, if we originally thought all the states were equally likely.
I've written down the "discrete" version involving sums, but there's also a "continuous" version involving integrals, and then it become more clear that entropy depends on two measures, often written dx and p(x)dx. The measure dx is given from the start - it's actually a "prior" in the Bayesian sense. The entropy tells us how surprised we should be when find a system distributed according to the probability measure p(x)dx. We normally write it like this:
S(p) = integral ln(p(x)) p(x)dx
But to clarify the role of the two measures, let's write
dm = p(x)dx
dn = dx
so that
p(x) = dm/dn
is the so-called "Radon-Nikodym derivative" of dm with respect to dn.
Then
S(p) = integral ln(dm/dn) dm
And we might as well give it a new name, "the entropy of m with respect to n":
S(m|n) = integral ln(dm/dn) dm
This tells you how much information you get, or how surprised you should be, when you thought the probability measure describing some event was n, and you find out it is m. Technically, it's given by
S(m|n) = integral ln(dm(x)/dn(x)) dn(x)
This is the correct way to think about entropy, in that it fully recognizes that two measures are involved in the definition. Often the measure dn is a "bland" sort of measure like Lebesgue measure or counting measure, and then we can easily ignore its role - but it's there!
Everett went further, and noticed a wonderful fact: you can use this concept to define a "distance" between probability measures, say d(m,n), which satisfies all the usual axioms:
d(m,m) = 0
d(m,n) >= 0
d(m,n) = d(n,m)
d(m,n) <= d(m,p) + d(p,n)
He calls this the "entropy metric". I was thrilled when my college pal Bruce Smith and I read Everett's thesis and checked these properties. Distance from entropy!
I wrote:
You've probably explained the entropy distance before [...]
I meant to write:
You've probably explained relative entropy before [...]
It would be great if you could change that and delete this post. Too bad there's no way to correct posts here like there is on Physics Forums. (I've been hanging out there a bit lately.)
I think your options are:
a) post the corrected version and I delete the first one.
b) This page claims you can delete any of your own comments if you're registered. Then repost the correct version.
Presumably Everett symmetrises S to derive d. Of course, we know from Lawvere that there's really no need to introduce this symmetrisation too early. In a world of one-way systems it really can be further from A to B than from B to A.
I'm not an expert, but I think Kullback and Leibler defined the concept. At the very least, they introduced it to the wider statistical audience. (I thought they actually called it "relative entropy", though.)
I guess I won't bother deleting my post and posting a corrected version, since that would then show up after posts commenting on it.... grrr!
Did Kullback and Leibler really do their thing before 1957? I'm glad they had the sense to call it "relative entropy" - it's really just the realization that entropy is always defined relative to a prior. This was concealed by the fact that the prior is often a bland one like normalized counting measure on a finite set, or a bland and infinite one like Lebesgue measure on the real line.
In statistics, infinite measures are called improper priors, which conjures up images of hanky panky in the abbey. Improper priors show up when you tackle such chestnuts as can you pick a random real number in a uniformly distributed way? And the paradoxes you get into before you understand this stuff....
In quantum statistical mechanics, improper priors are called weights - a generalization of the (pure or mixed) states we know and love. When we compute the entropy of a density matrix
S(D) = tr(D ln D)
we are really computing its entropy relative to a weight corresponding to the (non-normalizable) density matrix I - the identity matrix. The identity matrix corresponds to maximal ignorance.
David writes:
Presumably Everett symmetrises S to derive d. Of course, we know from Lawvere that there's really no need to introduce this symmetrisation too early. In a world of one-way systems it really can be further from A to B than from B to A.
I screwed up my first post because I originally thought "Kullback-Leibler divergence" was a name for entropy distance - I only noticed midstream, after a little browsing, that it referred to relative entropy.
Yes, I guess we get a notion of "entropy distance" by symmetrizing the relative entropy:
d(m,n) = S(m|n) + S(n|m)
And yes, it's a good point that non-symmetric "metrics" are common in real life. Watching people drive in and out of LA during the morning rush hour makes this very vivid.
But right now, I seem to recall that Everett's entropy distance was for random variables, not probability measures. He defined it in terms of relative entropy for probability measures, though.
John Baez said:
"Did Kullback and Leibler really do their thing before 1957?"
Wikipedia ("Kullback-Leibler Divergence") cites the following paper:
S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86, March 1951.
Post a Comment
<< Home