Conditionalization as I-projection
“In essence, the risk management approach to monetary policy-making is an application of Bayesian decision-making.” (p. 37)
“Our problem is not, as is sometimes alleged, the complexity of our policy-making process, but the far greater complexity of a world economy whose underlying linkages appear to be continuously evolving. Our response to that continuous evolution has been disciplined by the Bayesian type decision-making in which we have engaged.” (p. 39)
“Risk and Uncertainty in Monetary Policy,” American Economic Review, May, 2004, 33-40.
Now it's the turn of the US Food and Drug Administration to come out in favour of Bayesianism.
Thanks to Yet Another Machine Learning Blog for this. An earlier post - Maximum entropy and bayesian updating - on this interesting blog presents the following example from Kass of a possible clash between maximising entropy and conditionalization:
Consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A="X is an odd number"
Bayesian posterior P(X/A)= P(A/X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more '6', big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X/A) doesn’t have to have a 3.5 expectation. P(X) and P(X/A) are different distributions. Conclusion ? MaxEnt and bayesian updating are two different principles leading to different belief distributions. Am I right ?
Example 3 on p. 4 of Information topologies with applications by Peter Harremoes provides the answer here. Passing from a distribution P(X) to P(X/A) is just one simple case of a general process of projection from a point to a subspace of a space of distributions. Let P(X) be a distribution and A an event such that P(A)>0. Let C(A) be the set of distibutions Q, with Q(A) = 1. Then P(./A) is the closest element of C(A) to P in the sense of Kullback-Leibler distance (relative entropy). It is a robust bayes act to update thus.
More technically, C(A) is 'm-flat' in the sense of Amari, i.e., if Q and R are in C(A) then so is b.Q + (1 - b).R. The projection of P onto C(A) along the dual e-connection is P(./A). Forming the conditional distribution is but one small example of Csiszar's I-projection, which may use divergences other than the Kullback-Leibler.
Back to Kass' example, the MaxEnt formulation is projecting to the manifold of distributions satisfying both of the constraints, rather than just one as in the case of conditionalization.