## Learning Bayes [part 2]

In part 1, I talked about the Bayesian way of dealing with, well, noise, in a certain sense. How do I figure out that I “should not” conditionalise on a person’s astrological sign when predicting the cost of the bridge they’ll build, but that I “should” conditionalise on the bridge’s material without arbitrarily choosing the former to have zero influence and the latter to have “some” influence. This was brought up because a friend of mine was talking about stuff to me. And stuff.

And then that friend commented on the same post explaining that that did not quite get to the heart of what he was looking for. The best way I could find to phrase that was one of differently-parametrised Bayesian model selection. And like I said, I have no training in statistics, so I talked to my friend raginrayguns about it, and after a long time during which we both discussed this and talked and had loads of fun, we (he) sort of reached a conclusion.

Maybe.

I mean, there’s a high probability that we (he) reached a conclusion.

So, suppose we have two possible models, $M_1(\textbf a)$ and $M_2(\textbf a, \textbf b)$ that could explain the data, and these models have a different number of parameters. $\textbf a$ is a vector of the parameters both models have in common, and $\textbf b$ is the vector of the parameters that are present only in the second model.

My friend’s example is the following: he has a scatter plot of some data showing an apparently linear relationship between two variables: Y = α + βX + ε where I suppose ε is normally distributed. Upon closer inspection, however, it looks like there are actually two lines instead of only one! We had two interns, each of whom collected 50 of the samples.

So the common parameters are $\textbf a = (\alpha, \beta)^T$, and the parameters only the second model has are $\textbf b = (\lambda_{\alpha 1}, \lambda_{\beta 1}, \lambda_{\alpha 2}, \lambda_{\beta 2})^T$ which we’ll call the intern effect. In that case, then, the αs and βs of the second model are going to be seen as the same α and β from the first plus this intern effect.

So, nothing changes. To figure out the posterior probability of the parameters, we’d just use Bayes’ Theorem; same goes for the posterior of the models. But the old frequentist problem of “models with more parameters always have better fit” still remains. How do we get rid of it?

The trick is not judging a model based on its best set of parameters, but rather averaging over all of them. Let’s try this. Suppose the data is represented by $d$. Then we want the posteriors $p(M_1|d, X)$ and $p(M_2|d, X)$. Or maybe we just want the posterior odds for them. Whichever may be the case, we have:

$p(M_1|d, X) = \frac{p(d|M_1, X)}{P(d|X)}p(M_1|X)$

And then we can find the probability of the data given a model using the law of total probability:

$p(d|M_1, X) = \int p(d|\textbf a, M_1, X)p(\textbf a|M_1, X)d\textbf a$

And of course, the same applies for Model 2:

$p(d|M_2, X) = \int\int p(d|\textbf a, \textbf b, M_2, X)p(\textbf a|\textbf b, M_2, X)p(\textbf b|M_2, X)d\textbf ad\textbf b$

And in these, $p(d|\textbf a, M_1, X)$ and $p(d|\textbf a, \textbf b, M_2, X)$ are just the likelihood functions of traditional statistics. Then, the posterior odds – which are in general much more useful since they just define how much the evidence supports a hypothesis when compared to another instead of in absolute terms – are given by:

$O(M_1:M_2|D, X) = \mathcal{LR}(M_1:M_2;D)O(M_1:M_2|X)$

Where the likelihood ratio is just:

$\mathcal{LR}(M_1:M_2;D) = \frac{\int P(D|\textbf a, M_1, X)p(\textbf a|M_1, X)d\textbf a}{\int\int P(D|\textbf a, \textbf b, M_2, X)p(\textbf a|\textbf b, M_2, X)p(\textbf b|M_2, X)d\textbf ad\textbf b}$

Here I’m using capitals for the probability because I’m no longer talking about the specific sampling distribution but rather its value on the observed data $D$.

And there’s the thing that’s used here and that never ever shows up in frequentist statistics, as usual, which is the prior distribution for… well, everything. We have the prior odds ratio between the two models, and if we’re just interested in which of the two models the data supports better, we still need the priors for the parameters themselves. And this method is, of course, more general than just two models with a core of the same parameters and where one of the models has more parameters than the other; they can have two complete different sets of parameters, and you just do that.

What would a prior for the parameters look like? Of course, it depends on the case. One would expect, perhaps, that in the linear case described by my friend, they’d possibly be normally distributed with mean $0$ (no idea which variance), or something. Usually at this point we just throw our arms up and do a non-Bayesian thing: just choose something convenient. Note how I said that this is non-Bayesian. Picking and choosing priors that seem “reasonable” isn’t justifiable in the ideal case, because then the prior would be uniquely determined, but it’s the best we can do in practice, lacking perfect insight into our own states of knowledge.

Alright. So now we have the posterior probability – or at least posterior odds – for the models. So do we just pick the one with the highest posterior and go with it?

Not quite. There are a few problems with this approach. First and foremost, this is a Decision Theory problem, and unless you care only about strict accuracy, there might be value in using a model you know to be wrong because it’s not too wrong and you can still draw useful inferences from it.

For example, suppose that instead of having that the difference between the two possible lines is because of different interns, it happens because a certain effect affects two populations differently. That would mean that, in order to estimate the effect in either population, you would have access to much less data than if you bundled up both populations and pretended they were under the exact same effect. And while this might sound dishonest, the difference between the two models might be small enough that the utility of having on average twice as many points of data is larger than the disutility of using a slightly incorrect model. But of course, there is no hard-and-fast rule about how to choose, or none that’s a consensus anyway, and this is sort of an intuitive choice to balance the tradeoff between fit and amount of data.

Another problem is that, even if you’re willing to actually just pick a model and run with it, maybe there is no preferred model. Maybe the posterior odds of these two models is one, or close enough. What do you do, then? Toss a coin?

No. Bayesian statistics has a thing called a mixed model. We want to make inference, right? So that basically means getting a distribution for future data based on past data: $p(d_f|D_p, X)$. We can, once more, use the law of total probability:

\begin{aligned}p(d_f|D_p, X) &= p(d_f|M_1, D_p, X)P(M_1|D_p, X) \\ &+ p(d_f|M_2, D_p, X)P(M_2| D_p, X) \\ &+p(d_f|\bar{M_1},\bar{M_2}, D_p, X)P(\bar{M_1}, \bar{M_2}|D_p, X) \end{aligned}

If we’re fairly confident that either model 1 or 2 are the appropriate explanations for this data, i.e. $P(\bar{M_1}, \bar{M_2}|D_p,X)\approx 0$,, then we can use only the two first lines. Even if we don’t, we can still have a good approximation. So we predict future data by taking a weighted average between the predictions of each model where the weight is the posterior probability of those models.