Conditionalization as I-projection
First Greenspan,
“Risk and Uncertainty in Monetary Policy,” American Economic Review, May, 2004, 33-40.
Now it's the turn of the US Food and Drug Administration to come out in favour of Bayesianism.
Thanks to Yet Another Machine Learning Blog for this. An earlier post - Maximum entropy and bayesian updating - on this interesting blog presents the following example from Kass of a possible clash between maximising entropy and conditionalization:
Example 3 on p. 4 of Information topologies with applications by Peter Harremoes provides the answer here. Passing from a distribution P(X) to P(X/A) is just one simple case of a general process of projection from a point to a subspace of a space of distributions. Let P(X) be a distribution and A an event such that P(A)>0. Let C(A) be the set of distibutions Q, with Q(A) = 1. Then P(./A) is the closest element of C(A) to P in the sense of Kullback-Leibler distance (relative entropy). It is a robust bayes act to update thus.
More technically, C(A) is 'm-flat' in the sense of Amari, i.e., if Q and R are in C(A) then so is b.Q + (1 - b).R. The projection of P onto C(A) along the dual e-connection is P(./A). Forming the conditional distribution is but one small example of Csiszar's I-projection, which may use divergences other than the Kullback-Leibler.
Back to Kass' example, the MaxEnt formulation is projecting to the manifold of distributions satisfying both of the constraints, rather than just one as in the case of conditionalization.
“In essence, the risk management approach to monetary policy-making is an application of Bayesian decision-making.” (p. 37)
“Our problem is not, as is sometimes alleged, the complexity of our policy-making process, but the far greater complexity of a world economy whose underlying linkages appear to be continuously evolving. Our response to that continuous evolution has been disciplined by the Bayesian type decision-making in which we have engaged.” (p. 39)
“Risk and Uncertainty in Monetary Policy,” American Economic Review, May, 2004, 33-40.
Now it's the turn of the US Food and Drug Administration to come out in favour of Bayesianism.
Thanks to Yet Another Machine Learning Blog for this. An earlier post - Maximum entropy and bayesian updating - on this interesting blog presents the following example from Kass of a possible clash between maximising entropy and conditionalization:
Consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A="X is an odd number"
Bayesian posterior P(X/A)= P(A/X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more '6', big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X/A) doesn’t have to have a 3.5 expectation. P(X) and P(X/A) are different distributions. Conclusion ? MaxEnt and bayesian updating are two different principles leading to different belief distributions. Am I right ?
Example 3 on p. 4 of Information topologies with applications by Peter Harremoes provides the answer here. Passing from a distribution P(X) to P(X/A) is just one simple case of a general process of projection from a point to a subspace of a space of distributions. Let P(X) be a distribution and A an event such that P(A)>0. Let C(A) be the set of distibutions Q, with Q(A) = 1. Then P(./A) is the closest element of C(A) to P in the sense of Kullback-Leibler distance (relative entropy). It is a robust bayes act to update thus.
More technically, C(A) is 'm-flat' in the sense of Amari, i.e., if Q and R are in C(A) then so is b.Q + (1 - b).R. The projection of P onto C(A) along the dual e-connection is P(./A). Forming the conditional distribution is but one small example of Csiszar's I-projection, which may use divergences other than the Kullback-Leibler.
Back to Kass' example, the MaxEnt formulation is projecting to the manifold of distributions satisfying both of the constraints, rather than just one as in the case of conditionalization.
5 Comments:
Nice puzzle! Your analysis of it is a bit scary and technical. I don't mind technical math, but I always like to understand what's going on in simple terms, so let me try that here.
I figure that whenever two sensible procedures give you different answers to a question, what you've really got is two different questions masquerading as one. Let me try to see what these questions are in this example.
In the first question, you start with a die where, for whatever reason, you feel sure the probability for any side to come up is 1/6. Someone rolls it and says "an even number came up!" Now you say that the chance of a 2, 4, or 6 is 1/3.
In the second question you've got a die of dubious fairness. Someone tells you that the mean of the numbers that come up when you roll this die is 3.5; they also tell you that it always comes up 2, 4, or 6. So, you dream up the maximum entropy distribution meeting both these constaints.
If you phrase the puzzle this way, it doesn't sound surprising that the two questions have different answers.
Of course I'm trying to make the questions sound as different as possible, by not even mentioning maximum entropy in the first version: you're just sure, for some mysterious reason, that the die is fair!
Actually, what I'd really like to do is downplay the battle between MaxEnt and Bayes, and focus on some other niggly issues.
For example, you might say I'm being tendentious by saying that you're "sure" the die is fair in the first question. But, I think something is different in the second question: you are using the information about "mean 3.5" and "only even sides come up" at the same time, instead of taking the first as a god-given prior and only then taking into account the other piece of information.
This makes me think of noncommuting projections. If statistical reasoning amounts to starting with a point in the space of probability distribution and repeatedly projecting it onto subspaces as you get more new information, these projections might not commute, in which case your final probability distribution would depend on the order in which you received the information.
PQ needn't equal QP. There's also the commutative product of two projections P and Q formed by taking the limit of (PQ)^n as n -> infinity. The resulting projection projects onto the intersection of the ranges of P and Q. Could something like this be involved in the second question you pose?
Isn't it bad if your statistical reasoning depends on the order in which you got your data? I'm sure this actually happens with people in real life. You think differently about someone if he's been your friendly neighbor for years and then you hear they might have murdered their wife, than if you hear someone might have murdered his wife and then have them as your friendly neighbor for years. Early assumptions tend to get "locked in". But, it's not clear that this effect is "good".
The business of noncommuting projections and order-dependence also makes me think of curved connections... you've been hinting that these are important too, right?
Yes, part of the problem here is mixing up information about a single throw (it is odd), and information about the whole throwing situation (mean is 3.5). But things are not so different between being told about a single throw we're interested in that it's odd, and being told the die will only ever show odd.
I think the best justification for MaxEnt here is that by using it you're guaranteed a certain gain. Let's look at the easier case of a game where you start with a point, A, in R^3. I now choose another point, B, and I tell you a plane, P, it's sitting on. You've now got to choose a point C in P. Your gain will be BC^2 - AC^2. Now you can assure yourself of the gain AC^2 if you choose the perpendicular projection onto P. Any other choice of C and you risk making a loss.
MaxEnt is very much like this, except it's played with a different loss function linked to the Kullback-Leibler divergence. Other examples of loss function correspond to other divergences.
To the extent that we stick with KL-divergence, successive projections onto subspaces commute. The EM-algorithm, which I keep meaning to look at, uses different types of projection, indeed, the e- and m- of my post.
Thinking a bit more about this, perhaps things are not so simple. Here are two more situations:
A) You're told the mean is 5.9, and that the next throw is either a 1 or a 6.
B) You're told the mean is 5.9, and that the next throw is either a 1 or a 5.
Clearly you can't treat case B as though it concerned two constraints (the second being die can only show 1 or 5) since they are incompatible. On the other hand, knowing the mean is so high has to make you favour the 6 in A, and surely likewise the 5 in B.
I think what you have to do is consider two spaces. Take the probabilities of sequences of throws to be exchangeable, so by de Finetti's result your degrees of belief are representable as a density over the 6-simplex (p-1,...,p_6), sum p_i = 1. Given no information, this density should be symmetric about the uniform distribution. Either you run this as movement within this simplex starting from the central point, or else as movement within the space of densities over the simplex. Information such as that about the mean applies to these spaces, forcing you to restrict (either point or density) to the relevant hyperplane. On the other hand, information about possible values of the next throw has no effect.
Meanwhile, there's also a space of degrees of belief for the result of the next throw, again represented by the simplex. But here we're going to end up at one of the vertices when we find out the result. Information such as that the mean of the die is 5.9 has us scuttling over to the proximity of the 6 vertex. Movement in the first space has been mirrored in the second. But then we may hear as in case B that a 1 or 5 is thrown and have to project ourselves onto the line joining the 1 and 5 vertices, which we didn't do in the first space.
Interesting dicussion, especially the "projection" interpretation of inference!
John-> You introduce the idea of non commutativity, but, I think this issue doesn't arise in the current problem. Coming back to the original puzzle, the first method (conditionalization) just project the uniform prior on the space of "odd-only" distributions (result mean is 3). There is just one projection.
The second method performs 2 (commutatives) projections, one on the "E=3.5 space", the other on the "odd-only" space. Then the 2 results differs because they don't verify the same constraints.
Moreover, bayesian conditioning is also commutative (P(X|A,B)=P(X|B,A))). There is no data order sensitivity in bayes rule (unless explicitely modeled). But I'm sure you know this, so I probably misunderstood your views.
Then, the source of the "paradox" seems to me to be a bad interpretation of the conditioning result. We naively think that because uniform prior verifies E=3.5, the posterior still verify this constraint, and thus must match the 2 "contraints in one shot" result. This is the same explanation as hunch.net one.
David-> I'm still confused about the [B] isssue, mainly for 2 different reasons:
1} You seem to consider that MaxEnt gives a distribution on distribution, do you ? Is *the* maxent distribution a kind of MAP? Is a distribution entropy linked to its probability in the space of distributions?
2} In the die example, what does the Mean mean? If the die faces where not labelled with integers, but with colors or, more generally elements of a set without addition and division, what would be the definition of expectation, if any?
With this question, I wonder if the paradox could come from the fact that E=5.9 makes no sense for integers. Could we build a related puzzle on the real line?
Cheers,
Pierre
Let's begin with 2). What does 'the expectation of the throws of the die = 5.9' mean? We're imagining we're in a situation in which our degrees of belief in any given sequence is equal to that for the same sequence permuted, i.e., an exchangeable scenario. According to de Finetti this can be represented as a distribution over possible values of p_i. After 1000s of throws, as long as we started with a broad enough distribution we should find our posterior distribution tending to concentrate around a particular set of values for p_1, ..., p_6.
The information that expectation = 5.9 is telling us that if we should see enough throws then these limiting values would be such that sum i.p_i = 5.9. We can see that from this we must have p_6 at least 0.9. Anyway, we are forced to adjust our degrees of belief so that all of the distribution over the set of p_i lies on the hyperplane of distributions with that mean.
Of course, we should always reserve some of our prior so that we can ditch exchangeability in case we see, say, 1000 6s followed by 1000 1s.
Now, if asked what is the probability 1 will show on the next throw, we'll integrate the first co-ordinate over our current distribution of the p_i. On the other hand, in the case of B where we're also told that the next throw is a 1 or a 5, we shouldn't use this information to adjust our degrees of belief in the (long-term) p_i, but clearly we use it to give our degrees of belief in the outcome of the next throw in such a way that p_1 + p_5 = 1.
Being told that the die can only ever show 1 or 5 is different information and does apply to the long run degrees of belief. Only in this case it's inconsistent with the expectation information.
Now for 1). I haven't been talking about MaxEnt yet. A Bayesian can happily introduce more and more layers of parameters to produce a sufficiently finessed representation of their degrees of belief. In the case of exchangeable degrees of belief, however, we know we only have to go as far as a distribution over the simplex I mentioned. This in turn could be seen as a point in the infinite-dimensional space of all such distributions.
Hmm, this MaxEnt needs a little thinking about. We have 3 spaces on the go, and relations between them. We can find the centre of mass of the distibution over the simplex which gives us a single point in the simplex, and then think about how this moves. Then where we are in this simplex will dictate our degrees of belief for the next throw so long as we haven't had any extra short-term information.
Post a Comment
<< Home