## Learning Bayes [part 3.5]

In part 3, I discussed the problem of finding a way of drawing a posterior point estimate of a number $\nu$ based on a series of point estimates $\boldsymbol{\hat\nu}$ that’s more “theoretically valid” than taking the median, which is the standard of the domain that inspired the post in the first place. I arrived at a likelihood function like:

$p(\boldsymbol{\hat\nu} | \nu, G(\boldsymbol\varphi, \boldsymbol \lambda)) = \prod_i St(\nu | \hat\nu_i, \lambda_i, \varphi_i)$

So I decided to look at what that looks like with the vector $\boldsymbol{\hat\nu} = (0.9, 0.65, 0.8, 38)^T$ and various values of $\lambda$ and $\varphi$ (for now all the individual hyperparametres of the estimates will be the same).

$\lambda = \varphi = 0.001$ for something like an “ignorance” prior, with $0.0005$ “effective” prior observations of precision $0.001$, or variance $1000$:

$\lambda = 0.5$ and $\varphi = 0.01$, for again almost no prior observations but with a higher precision:

Pretty much the same thing.

However, for $\lambda = 0.01$ and $\varphi = 2$, one effective prior observation with sample precision $0.01$ (whatever the hell that means with only one observation):

Which is pretty, well, pretty. It’s not even multimodal, and the prior confidence in all four estimates is exactly the same, with a fairly low precision. If I take the precision to $\lambda = 1$:

The Student’s t-distribution is the Gaussian when marginalising over the precision $\tau$ weighted according to a Gamma distribution.

$Gam(x|\alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$

That prior distribution is, like I said, equivalent to having observed $\varphi = 2\alpha$ effective prior points with sample precision $\lambda = \alpha / \beta$. For $\alpha=0$, however, that’s not normalisable; in fact, in the way it’s given, it’s infinite. But we can rewrite it:

$Gam(x|\alpha, \beta) = \frac {x^{\alpha - 1} e^{-\beta x}} {\int_o ^{+\infty} x^{\alpha - 1} e^{-\beta x} dx}$

Taking the limit for $\alpha$ and $\beta$ approaching $0$ gives us $Gam(x|\rightarrow 0, \rightarrow 0) \propto 1/x$, which is the ignorance prior for a scale parametre of a distribution in the same way the constant distribution is the ignorance prior for a location parametre (like the mean). We’ve been using the ignorance prior for $\nu$; what if we used the ignorance prior for the precisions $\tau_i$ as well?

\begin{aligned} p(\hat\nu_i | \nu, Ign) &= \int_0^{+\infty} \mathcal N(\nu | \hat \nu_i, \tau_i^{-1}) \tau_i^{-1} d\tau_i \\ &= \int_0^{+\infty} \frac {e^{\frac {-\tau_i (\nu - \hat\nu_i)^2} 2}} {\sqrt{2 \pi \tau_i}} d\tau_i \end{aligned}

And the above does have an analytic form:

$p(\hat\nu_i | \nu, Ign) = \frac 1 {|\nu - \hat\nu_i|}$

So the likelihood function for all estimates is:

$p(\boldsymbol{\hat\nu} | \nu, Ign) = \frac 1 {\prod_i |\nu - \hat \nu_i|}$

Of course, there’s the problem that the above does not converge at all, it’s so improper it makes me cry. And well, Jaynes used to say that when the posterior is improper (and the prior was derived from a well-defined limiting process, which in this case it wasn’t, exactly) it’s because we don’t have enough information for inference. I’m not sure what information would be sufficient for inference in this case, but well, such is life, I guess.

I’ll probably talk about the Laplace distribution in part 4, when I get to it, but for now, I think the Student’s t is pretty good.