### Ruminating

If some blog posts record the results of the author's digesting some body of thought, what follows is some, at best, half-chewed reflections on my latest wanderings in machine learning.

First, something which seems inescapable if you're looking to impose a geometry on a statistic manifold is the Fisher information metric. Now, it appears that a good justification for this was given by Censov in 1982. Apparently, this is the only Riemannian metric invariant under congruent embeddings by a Markov morphism. What this amounts to is requiring that the effect of re-partitioning an event space on a probability distribution be sensible. I found this out from Guy Lebanon's very interesting thesis, where he extends the result to conditional spaces (chapter 6). These are useful for modelling the conditional distribution of output data on input data, rather than the joint distribution of this data. Campbell had already extended Censov's results to non-normalized positive measures, on the way dropping the category theoretic apparatus. (It's never too late to reintroduce it.)

Now the distances that fit neatly with the Fisher information metric are the δ-divergences (p. 5 of this), which include the Kullback and reverse Kullback divergences. This opens you to the glorious world of information geometry (see this list), convex optimization, Legendre transforms between δ-coordinates and (1 - δ)-coordinates, etc. The Zhu and Rohwer articles argue for the advantages of working within the space of all positive measures, rather than of normalized probability distributions, which is δ-flat for all δ, i.e., Christoffel symbols vanish.

All is going swimmingly, except that with some spaces of model you're interested in, like multi-layered neural nets and other graphical models, there's no one-one mapping between the model parameters and the space of distributions, which messes up the geometry in parameter space. Now, there was a trend to move away from neural nets, but they have never quite disappeared. Some, like Geoffrey Hinton, still hope that we can learn something about the brain from studying plausible neural net algorithms, see What kind of a Graphical Model is the Brain?, perhaps discovering some conceptual representations in the higher layers of a trained net.

This runs against the idea that we'd be better off simplifying our task by producing a machine which can merely discriminate between inputs, such as images of 4s and images of 8s, rather than a model which aims to

A second trend, especially if you were a Bayesian neural net person, was to notice that in some kind of limit of the number of hidden nodes in a layer, what emerged was a Gaussian process. (For the life of me I can't see why information geometry hasn't invaded Gaussian process theory.)

Perhaps, then, layered models are worth sticking with. So is there anything we can do with the non-smooth mapping between parameter space and distribution space. Yes, we turn to algebraic geometry. First, we can follow Watanabe and use Hironaka's resolution of singularities. Second, we follow Pachter and Sturmfels, and say that

Well, I did say it was half-chewed.

First, something which seems inescapable if you're looking to impose a geometry on a statistic manifold is the Fisher information metric. Now, it appears that a good justification for this was given by Censov in 1982. Apparently, this is the only Riemannian metric invariant under congruent embeddings by a Markov morphism. What this amounts to is requiring that the effect of re-partitioning an event space on a probability distribution be sensible. I found this out from Guy Lebanon's very interesting thesis, where he extends the result to conditional spaces (chapter 6). These are useful for modelling the conditional distribution of output data on input data, rather than the joint distribution of this data. Campbell had already extended Censov's results to non-normalized positive measures, on the way dropping the category theoretic apparatus. (It's never too late to reintroduce it.)

Now the distances that fit neatly with the Fisher information metric are the δ-divergences (p. 5 of this), which include the Kullback and reverse Kullback divergences. This opens you to the glorious world of information geometry (see this list), convex optimization, Legendre transforms between δ-coordinates and (1 - δ)-coordinates, etc. The Zhu and Rohwer articles argue for the advantages of working within the space of all positive measures, rather than of normalized probability distributions, which is δ-flat for all δ, i.e., Christoffel symbols vanish.

All is going swimmingly, except that with some spaces of model you're interested in, like multi-layered neural nets and other graphical models, there's no one-one mapping between the model parameters and the space of distributions, which messes up the geometry in parameter space. Now, there was a trend to move away from neural nets, but they have never quite disappeared. Some, like Geoffrey Hinton, still hope that we can learn something about the brain from studying plausible neural net algorithms, see What kind of a Graphical Model is the Brain?, perhaps discovering some conceptual representations in the higher layers of a trained net.

This runs against the idea that we'd be better off simplifying our task by producing a machine which can merely discriminate between inputs, such as images of 4s and images of 8s, rather than a model which aims to

**generate**the data. But Hinton claims to be able to produce more accurate generative models than the best discriminative classifiers.A second trend, especially if you were a Bayesian neural net person, was to notice that in some kind of limit of the number of hidden nodes in a layer, what emerged was a Gaussian process. (For the life of me I can't see why information geometry hasn't invaded Gaussian process theory.)

Perhaps, then, layered models are worth sticking with. So is there anything we can do with the non-smooth mapping between parameter space and distribution space. Yes, we turn to algebraic geometry. First, we can follow Watanabe and use Hironaka's resolution of singularities. Second, we follow Pachter and Sturmfels, and say that

(a) Statistical models are algebraic varieties.An easy example of (a), concerning a distribution of two binary variables, expresses the independence of these variables as requiring the distribution to satisfy an equation in R

(b) Every algebraic variety can be tropicalized.

(c) Tropicalized statistical models are fundamental for parametric inference.

^{4}, namely, p_{00}.p_{11}- p_{01}.p_{10}= 0. But what are the*tropics*doing here? Well tropical maths is what John Baez and I were discussing here, and Sturmfels has a gentle introduction here. I have a sneaking feeling it would be worth trying to understand whether the tropical/ordinary = Legendre/Laplace transform analogy has anything to do with the appearance of the Legendre transform earlier.Well, I did say it was half-chewed.

## 0 Comments:

Post a Comment

<< Home