## Friday, June 09, 2006

### Tsallis entropy

My research in the theory of machine learning has got me thinking hard about maximum entropy (MaxEnt). It transpires that a whole host of machine learning algorithms are applications of this idea - maximising the Shannon entropy of a distribution subject to various constraints. See, e.g., Y. Altun and A. Smola, Unifying Divergence Minimization and Statistical Inference via Convex Duality. Not only do we find a duality between maximum likelihood estimation and MaxEnt with point constraints, but also one between Maximum a posteriori probability (MAP) estimation and MaxEnt with approximate constraints on expectations.

This has led me into the terrifying vast literature on maximum entropy. But it doesn't stop there. There are many who hold that the Shannon entropy is just one entropy, and the corresponding Kullback-Leibler divergence just one divergence. An important new entropy on the scene is the so-called Tsallis entropy, see here for an introduction by Tsallis himself.

Now, readers of this blog will know I'm interested in so-called q-deformation of mathematics (here and here). So, I'm intrigued to find that this Tsallis entropy can be thought of a q-deformation of the Shannon entropy. Presumably, then, just as so many of the standard distributions (Gaussian, Poisson, etc.) can be expressed in MaxEnt terms , there are corresponding q-deformations. And presumably there are q-deformations of Gaussian processes, and, who knows, these may find their uses in machine learning.

Walt said...

I've spent the last 24 hours obsessed with MaxEnt, so I found it disturbing to see you mention it today. :-)

June 10, 2006 3:10 AM
david said...

Just a single point on the path to the whole world contemplating MaxEnt.

There does seem to be the sense abroad that Shannon entropy is to entropy as Euclidean geometry is to geometry. Just as one can depart from Euclidean geometry by the steady removal of assumptions (constant curvature other than 1, variable curvature, differential topology, topology, set theory), there seem to be many variants of the notion of entropy. There's a nice geometric interpretation of the distributions that maximise Tsallis entropy here. If you place a uniform distribution on the n-sphere of radius root n, and project onto m of the co-ordinates, Poincare had observed that in the limit as n tends to infinity, this m-vector is distributed according to an m-variate Gaussian distribution with unit covariance matrix. Now these Gaussian distributions are maximisers of the Shannon entropy with this covariance matrix. Some of the Tsallis MaxEnt distributions come from projecting m dimensions from a uniform distribution over an n-sphere. You can pick up distributions corresponding to different covariance matrices by looking at uniform distributions over corresponding ellipsoids.

That gets me thinking, how to generate an m-variate Gaussian with non-identity covariance matrix. You kinda need an infinite dimensional matrix. Is this where Gaussian processes appear, projecting out from infinite dimensional ellipsoids?

June 10, 2006 10:32 AM
walt said...

I don't know much about Tsallis entropy, but Cosma Shalizi seems negative about the idea here.

June 12, 2006 4:48 AM
david said...

It certainly is controversial. I see Tsallis has teamed up with Murray Gell-Mann - useful backing. There's some interesting work by Flemming Topsoe on a game theoretic interpretation of entropy. MaxEnt distributions are stable points in a game against an adversary. It's a broad enough framework to take in other entropies. Tsallis entropies have one nice feature in that their complexity functions factorise into an escort function and a surprise factor.

June 12, 2006 7:32 AM
Deane said...

Maximizing Tsallis entropy is equivalent to maximizing Renyi entropy, but Renyi entropy fits in much more elegantly with other mathematical ideas. See paper #13 on my web page.

June 15, 2006 2:58 AM
david said...

Deane,

Interesting. Google shows plenty of returns for "generalized Gaussian process". I wonder if these would correspond to your sense of generalized Gaussian (p. 2).

June 15, 2006 2:01 PM
david said...

The Student t distribution appears to be a form of maximum Renyi entropy distribution, and there are such things as Student t processes (section 9.9 of this book).

June 20, 2006 10:39 AM
John Baez said...

Now, readers of this blog will know I'm interested in so-called q-deformation of mathematics (here and here). So, I'm intrigued to find that this Tsallis entropy can be thought of a q-deformation of the Shannon entropy.

I don't think so. His "q-derivative" looks completely different from the q-derivative used in quantum group theory and other branches of "q-mathematics". So his "q-exponential", which has q-derivative equal to itself, is not the same as the q-exponential used in these other areas.

The usual q-exponential is defined by taking the Taylor series for the exponential and replacing factorials by q-factorials. Tsallis defines his to be the limit as q -> 1 of

(1 + (1-q)x)^{1/(1-q)}

I think they're just different.

For a feeble attempt at synthesizing Tsallis' q-exponential and the ordinary one, see Jaganathan and Sinha. This paper supports my suspicion that the main relationships between Tsallis' q-deformation and the usual one are:

1) it's a deformation

2) the deformation parameter is called "q".

So far I have not seen any principled derivation of this Tsallis entropy business, so
I'm tempted to agree with Cosma Shalizi's negative assessment. But, if someone could explain to me what assumptions lead to Tsallis entropy, I might change my tune.

June 28, 2006 7:45 PM
John Baez said...

There's a quantity called
"q-Renyi entropy" which generalizes the ordinary entropy as follows: given a probability measure space (M,dx) and another probability measure dp on this space, the q-Renyi entropy of dp with respect to dx is:

[integral_X (dp/dx)^q dx]^{-1/q}

There is one q-Renyi entropy for each reasonable choice of the number q. As q -> 0, the q-Renyi entropy approaches the exponential of the usual Shannon-Boltzmann entropy of dp relative to dx.

Renyi entropy looks more sensible to me than Tsallis entropy, because I don't know any principles that lead to Tsallis entropy, but there is a nice theorem due to Shore and Johnson which justifies the use of Renyi entropy. Their theorem says that a recipe for updating probability distributions given new data must amount to choosing the distibution that maximizes some q-Renyi entropy - at least if the recipe satisfies 5 plausible axioms.

So, the usual "maximum entropy principle" is one of a manageable continuum of principles for statistical inference, namely the "maximum q-Renyi entropy principles".

Deane wrote:

Maximizing Tsallis entropy is equivalent to maximizing Renyi entropy, but Renyi entropy fits in much more elegantly with other mathematical ideas. See paper #13 on my web page.

This would be nice, since it would allow me to ignore Tsallis entropy to a certain extent, and just learn about Renyi entropy.

Is the Renyi entropy just the -1/qth power of the Tsallis entropy???

June 29, 2006 5:33 AM
John Baez said...

John wrote:

Is the Renyi entropy just the -1/qth power of the Tsallis entropy???

No, and I got the definition of Renyi entropy slightly wrong: it's really the logarithm of what I wrote down.

But, they're close enough that it's easy to see that maximizing one is the same as maximizing the other. There's an paper by Grendar and Grendar which points this out and says other interesting stuff...

So, I can now ponder the axioms which uniquely pick out maximization of q-Renyi entropy for various possible values of q as the only sensible principles for updating ones probability distributions. I want to see how these are related to Bayes' theorem.

June 29, 2006 6:01 AM
david said...

Yes, their deformed exp seems to be designed to satisfy:
exp[a].exp[b] = exp[a + b] with a deformed '.' and '+', whereas you have written about the deformation of the coefficients of a series.

Speaking of deformed probability distributions, there is the idea that the q = 0 version of the Gaussian is the Wigner semicircular distribution (see, eg., this). This is the kind of deformation you're happy with, and points to the vast subject of free probability (see Speicher and Voiculescu).

Their theorem says that a recipe for updating probability distributions given new data must amount to choosing the distibution that maximizes some q-Renyi entropy - at least if the recipe satisfies 5 plausible axioms.

It's always good to bear in mind just how plausible a requirement it appeared to be once that for any triangle there is a similar, non-congruent one. But sure, this is useful work.

But as I have said in other posts and comments, I think the big show in town is information geometry, where a host of divergence functions flourish. All we need is for you to explain to us papers like this one. It's fascinating that so many techniques in machine learning are just projections along geodesics.

Could there be scope for categorified differential geometry to liven up information geometry?

June 29, 2006 10:51 AM
david said...

Something to reassure one about a new entropy is whether maximising it with respect to some constraint gives a distribution we already know about. We know that all our favourite distributions (normal, Poisson, geometric, Binomial) can be seen to arise by maximising Shannon entropy.

So, it's interesting to see here then that the Student's t- and Student's r- distributions arise by maximising the Renyi entropy. A variant of that q-deformed exponential discussed above appears in equation (2).

July 11, 2006 11:22 AM
Deane said...

I view Tsallis entropy as a linear approximation to Renyi entropy.

Renyi entropy is definitely much cleaner to work with. For example, the classical notion of relative entropy extends very nicely to a notion of relative Renyi entropy. Also, all the known inequalities relating Shannon entropy, the second moment, and Fisher information (e.g., Cramer-Rao) extend very nicely to Renyi entropy, p-th moments, and a generalized notion of Fisher information with a single unified proof (inspired by beautiful work of Cordero-Nazaret-Villani on sharp Sobolev inequalities).

All of this is in the paper I cited in my first post.

July 18, 2006 5:06 PM
Philip Dawid said...

I just came across this discussion by chance.

There is in fact a very wide range of functions that can reasonably be described as entropies. Any statistical decision problem can be used to define an associated entropy, as well as a whole battery of other associated concepts, such as proper scoring rule, divergence function, decision geometry,... Many nice properties (including fo example Pythagorean equalities and inequalities) can be shown at this high level of generality.

For further details see:
Gr¨unwald, P. D. and Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy,
and robust Bayesian decision theory. Ann. Statist. 32, 1367–1433
-- which extends the game-theoretic framwork of Topsoe;

and, for specifically geometrical aspects, my research report 268
"The geometry of proper scoring rules", available at http://tinyurl.com/j94vk,
which is to be published in the Annals of the Institute of Statistical Mathematics
(since recently discovering Tsallis I have now renamed the "power entropy" that appears in this after him).

August 18, 2006 4:00 PM
Anonymous said...

I think that Tsallis entropy could be regarded as an appropriate measure for fractal phase space in Statistical Mechanics. The following article on the microcanonical expression for this entropy might be helpful:
http://arxiv.org/abs/cond-mat/0508606

September 17, 2006 11:16 PM