## Learning Bayes [part 3]

I was talking to a friend (the same friend who inspired the two previous posts), who was talking to a friend of ours about a thing, and there’s a context but it doesn’t matter to what I want to write here.

Suppose there is some quantity, I’ll call it $\nu$, that I don’t know. Now, some people have estimated it, and given me point estimates $\hat\nu_1$, $\hat\nu_2$, etc, so I have a vector $\boldsymbol {\hat\nu} = (\hat\nu_1, ..., \hat\nu_N)^T$ of estimates.

One possible way to get a posterior estimate for what the true value is, in a sort of Bayesian Model Averaging way, is by having a vector $\boldsymbol p$ of confidences in each of those estimates and then having that $\hat\nu = \boldsymbol p ^ T \boldsymbol {\hat\nu}$, which is a weighted average of the estimates. However, in the context of the question, the standard practice seems to be using the median instead of the average, because then we get rid of outliers.

This seems, at first glance, unjustified. Surely there’s some way to use the estimates themselves to determine that an estimate is problematic? Well, suppose $\boldsymbol{\hat\nu} = (0.9, 0.65, 0.8, 38)^T$ are the four different estimates. When you look at that, it seems obvious that the fourth person screwed up. Yet, how can you really tell, if all the information about $\nu$ you have to go on are those estimates? What if the first three are the wrong ones? The median may be theoretically inadequate, but how do you reconcile that with the fact that it doesn’t let weird estimates screw your expected value up too much? Even if, a priori, you trusted each of those four banks equally, after seeing that last value it seems very likely that it’s horribly wrong.

My friend suggested a couple of ways of dealing with that, measures that’d be between the average and the median. One suggestion was taking an average between the average and the median of the estimates. The other was a bit more complicated, and it went thusly:

Let $Z(\hat\nu_i|\boldsymbol{\hat\nu})$ equal the number of standard deviations $\hat\nu_i$ is above or below the average $\bar\nu$. Then he suggests that the components of the vector $\boldsymbol p$ should be given by:

$p_i(\boldsymbol{\hat\nu}) = \frac{e^{-|Z(\hat\nu_i | \boldsymbol {\hat \nu})|}}{\sum_ke^{-|Z(\hat\nu_k | \boldsymbol {\hat \nu})|}}$

This actually produces interesting results, corresponding to intuition! And I want to thwack him upside his frequentist head for even thinking of creating an operational tool like the ones he did before trying to derive stuff from first principles, but he ended the email he sent me with the questions, “What do you think? Are there established methods to update your beliefs in each of the models of a set, conditional on the predictions of all of them?” so at least his second instinct of trying to figure out if there exists another way is good (when you read this, I still love you ♥).

Let’s first suppose each estimate was generated by a normal distribution: $\hat\nu_i \sim \mathcal N(\nu, \tau_i^{-1})$ (where the parametre $\tau_i = 1/\sigma_i^2$ is known as the precision of the distribution). We’ll call this model $G(\boldsymbol\tau)$ (Gaussian model of precision vector $\boldsymbol \tau$) and condition on it in our distributions. The likelihood function of the vector of observations $\boldsymbol{\hat\nu}$ is:

$p(\boldsymbol{\hat\nu}|\nu, G(\boldsymbol\tau)) = \prod_i\mathcal N(\hat\nu_i | \nu, \tau_i^{-1})$

(I’m not conditioning on the $X$ symbol for our prior information for ease of notation, but let’s not forget it’s always there.)

The Maximum Likelihood Estimate for $\nu$, i.e. the value of $\nu$ that makes the above function maximal, is given by $\bar\nu' = \sum_i\tau_i\hat\nu_i / \sum_i\tau_i$ (where the prime is to indicate that this is a weighted average and not the regular average). Now, a Bayesian needs to always take the prior probability for stuff into account. What we really want is:

$p(\nu|\boldsymbol{\hat\nu}, G(\boldsymbol\tau)) = \frac{p(\boldsymbol{\hat\nu} | \nu, G(\boldsymbol\tau)) p(\nu)}{\int_{-\infty}^{+\infty} p(\boldsymbol{\hat\nu} | \nu, G(\boldsymbol\tau)) p(\nu) d\nu}$

Where we used the fact that $G(\boldsymbol\tau)$ is a hypothesis about the form of the likelihood of the vector $\boldsymbol{\hat\nu}$ and is therefore independent of $\nu$.

A Bayesian, however, reduces to a frequentist once their prior knowledge becomes effectively zero. The conjugate prior for the mean of a normal distribution is itself normal, $\nu \sim \mathcal N(\mu, \tau_0^{-1})$, but in the limit of zero precision, $\lim\limits_{\tau_0 \rightarrow 0} \mathcal N(\mu, \tau_0^{-1}) = const.$ is an improper constant prior over the reals (I’m coming around to Jaynes’ view that improper-but-result-of-well-defined-limit priors are ok). If we use this ignorance prior, our Maximum A Posteriori value for $\nu$ becomes $\bar\nu'$ as well.

So, in this case, our hypothesis $G(\boldsymbol\tau)$ gives us our confidence vector $\boldsymbol p = \boldsymbol{\tau} / \text{Tr}(\textbf I \boldsymbol { \tau})$ and we have $\hat\nu = \boldsymbol p^T \boldsymbol{\hat\nu}$.

Alright, so, $G(\boldsymbol\tau)$ generates exactly the behaviour we were trying to escape from, that just averaging based on our confidence in each result would screw us over with outliers, which is a known problem of Gaussians because of their light tails. But what if, instead of having point confidences in each estimate, we had a distribution for our confidence, that we updated upon seeing the estimates?

The conjugate prior for the precision of a normal distribution is something called a Gamma distribution:

$Gam(x|\alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$

where $\Gamma(\cdot)$ is the Gamma Function, and thus it’s normalisable for $\alpha > 0$ and is finite everywhere for $\alpha \geq 1$. If the precision is distributed according to that, this is equivalent to having observed $\varphi = 2\alpha$ “virtual” or “effective” prior data points with sample precision $\lambda = \alpha / \beta$.

We can encode our prior confidence in each estimate with this function, using the hypothesis $G(\boldsymbol\varphi, \boldsymbol\lambda)$, and then the likelihood function for each estimate is:

$p(\hat\nu_i | \nu, G(\boldsymbol\varphi, \boldsymbol\lambda)) = \int_0^{+\infty} p(\hat\nu_i | \nu, \tau_i, G(\boldsymbol\varphi, \boldsymbol\lambda)) p(\tau_i | G(\boldsymbol\varphi, \boldsymbol\lambda)) d\tau_i$

$p(\hat\nu_i | \nu, \tau_i, G(\boldsymbol\varphi, \boldsymbol\lambda)) = \mathcal N(\nu | \hat\nu_i, \tau_i^{-1})$

$p(\tau_i | G(\boldsymbol\varphi, \boldsymbol\lambda)) = Gam(\tau_i |\frac{ \varphi_i } 2, \frac{ \varphi_i }{ 2\lambda_i})$

$p(\hat\nu_i | \nu, G(\boldsymbol\varphi, \boldsymbol\lambda)) = \int_0^{+\infty} \mathcal N(\nu | \hat\nu_i, \tau_i^{-1}) Gam(\tau_i | \frac{\varphi_i } 2, \frac{\varphi_i }{ 2\lambda_i}) d\tau_i$

That thing inside the integral is known as the Normal-Gamma distribution, frequently used in Bayesian statistics when both the mean and the precision of a Gaussian are unknown. Evaluating the integral gives us what’s known as the Student’s t-distribution:

$St(x|\mu, \lambda, \varphi) = \frac {\Gamma((\varphi +1) /2)} {\Gamma(\varphi /2)} \sqrt{\frac {\lambda} {\pi \varphi}} [1 + \frac {\lambda (x - \mu)^2} {\varphi}] ^{-(\varphi + 1) /2}$

(Normally the t-distribution is presented with $\mu=0$ and $\lambda = 1$, but this alternative parametrisation is more general and the result of our derivation.)

In the case of our Normal-Gamma distribution, $\mu = \hat\nu_i$, so:

$p(\hat\nu_i | \nu, G(\boldsymbol\varphi, \boldsymbol\lambda)) = St(\nu | \hat\nu_i, \lambda_i, \varphi_i)$

The Student’s t-distribution, then, can be seen as a sum of an infinity of Gaussians with each possible precision, where the precisions are weighted by a Gamma distribution. Taking the limit of $\varphi_i \rightarrow \infty$ while keeping $\lambda_i$ constant turns the t-distribution into a Gaussian with precision $\lambda_i$, reducing us to the case we just discussed, which intuitively makes sense: if we see an infinity of samples with a given precision, we will believe with infinite confidence that the precision of our Gaussian is exactly that.

How does sequential learning happen with a t-distribution, then?

$p(\boldsymbol{\hat\nu} | \nu, G(\boldsymbol\varphi, \boldsymbol \lambda)) = \prod_i St(\nu | \hat\nu_i, \lambda_i, \varphi_i)$

The Student’s t is the inverse of a polynomial (where the powers aren’t necessarily integers), so the likelihood function might have several local maxima. This puts us in a bit of a pickle. However, it does appear clear that we’re no longer in a spot of just multiplying a confidence vector by the estimates and getting the weighted average directly. Stuff’s more… complicated, now.

There exist, however, methods of finding the MLE using EM algorithms. Google gives me a few interesting results and hopefully when I get to Gelman v2b I’ll have more tools to deal with it. The algorithm itself seems straightforward when $\forall i, j. \varphi_i = \varphi_j \land \lambda_i = \lambda_j$, the case where our prior confidence in each estimate is the same. And the literature seems to agree that using the Student’s t is much more robust to outliers than using straight-up Gaussians, so yay!

(Maybe we shouldn’t use the MLE but rather the expected value $\mathbb E(\nu|\boldsymbol{\hat\nu}, G(\boldsymbol\varphi, \boldsymbol \lambda))$ instead? But in the end that just makes the numerator of the thing a linear function of $\nu$, so it’s not like the technique will change much.)

It’s good to remember what we’re talking about, here. The estimates $\hat\nu_i$ aren’t really samples of random processes in the frequentist sense, or not necessarily. In the context of the conversation that sparked this post, they definitely aren’t. We’re using the distributions here just to describe our states-of-knowledge, and they encode assumptions about the data.

One such assumption, for example, is unimodality. Neither normal nor Student’s t-distributions can capture multimodal distributions. It’s why I’m vehement that, even if we don’t include the $X$ in our notation to indicate our global prior knowledge, we do include $G(\boldsymbol \varphi, \boldsymbol\lambda)$ to indicate that we’re assuming the Student’s t model with the above considerations.

My friend wanted a way to operationalise that. I hope this theoretical discussion helps him, I’m not yet good enough with statistical tools and languages on a practical level to do that for him. If the above is not enough for him, I’m sure he’ll tell me so.

(Though when we’re maximising a function like the Student’s t, the median doesn’t sound too awful. And if we use a Laplace distribution instead, the median is in fact theoretically sound. I just haven’t yet tried to think about what qualitative assumptions it hides.)

(Continues here.)