Learning Bayes [part 1]

I have a confession to make.

I don’t actually know Bayesian statistics.

Or, any statistics at all, really.

Shocking, I know. But hear me out.

What I know is… Bayesian theory. I can derive Bayes’ Theorem, and I also can probably derive most results from it. I’m a good mathematician. But I haven’t actually spent any time doing practical Bayesian statistics stuff. Very often a friend, like raginrayguns, tells me about a thing, a super cool thing, or asks me a thing, a super confusing thing, and it sort of goes over my head and I have to really do stuff from scratch to figure out my way. I don’t have the heuristics, I don’t have all the techniques of problem-solving perfectly available to me.

For example, earlier today another friend of mine came up to me and asked me a question about the difference between Bayesian and frequentist statistics. Basically, he has a bunch of data about a lot of bridges, and then four pieces of information about each of them: cost, material, length, and astrological sign of the designer of the bridge. He wanted – I had to ask a lot of questions to figure this out because, as I said, I don’t do statistics yet, I don’t know the jargon – to find the posterior distribution for the cost of his projected bridge, given a material, a length, and his astrological sign. Or rather, he wanted the Bayesian answer, because he knew the frequentist one already.

Let me pause this a bit, and talk about another problem.

Suppose someone wants to estimate my height. They have some prior distribution of heights p(``\text{My height equals h}"|X) (which we’ll call p(h|X) for short) given by, say, the distribution of heights in my country of birth.

Then one comes and says, “We could get the height of his father!” And that’s a good idea, so they find out that the height of my father is F, and find the posterior p(h|F, X). So far, so good.

Then another comes and says, “His name is Pedro! We should get data from Pedros!” So they round up 50 Pedros and…

And you don’t feel like this will be very useful at all, do you? You will be in fact discarding about 198 million people’s worth of information to focus on a group of 50 Pedros that has nothing to do with anything.

But if I am one amongst those 50 Pedros, then well, then it becomes pretty cogent information! They no longer care about those other 198 million people in Brazil, since the actual me is in the actual group of Pedros.

Bayes’ Theorem goes:

p(h|\text{Pedro}, F,X) = p(h|F,X)\frac{p(\text{Pedro}|h, F, X)}{P(\text{Pedro}|F,X)}

(You’ll notice I used a lower-case p thrice and an upper-case P once. This is a convention: upper-case means an actual number that is a probability, and lower-case means a probability density function.)

  • p(h|F,X) is the prior distribution of heights, a function of h.
  • p(\text{Pedro}|h, F, X) is the likelihood function of Pedro as a function of h.
  • P(\text{Pedro}|F,X) is just a normalising constant, the probability that I am called Pedro given that my father has height F.

Presumably p(\text{Pedro}|h, F, X) will in fact be a constant or nearly so, and equal to P(\text{Pedro}|F,X), because we suspect that names have only spurious correlations with height when at all. If that’s the case, then presumably the posterior and the prior are exactly equal, because the likelihood “function” cancels the normalising constant out.

But you know it won’t cancel perfectly, especially with a number as small as only 50 people, and you will in fact be throwing information out with that.

Unless I am in the measured group of Pedros, in which case yes we’re throwing information out, but only useless information, and we want to use Bayes’ Theorem just like that.

So in the case where I am in the group, we want the posterior to look just like what Bayes says; but in the case where I’m not, we want the posterior to look exactly the same as the prior.

And of course, the consistency desideratum says that any two ways of calculating a quantity ought to give you the same result. So, suppose we invent a proposition N : \text{The name is relevant to the distribution} (and let’s rename the proposition “His name is Pedro” to just P). We can rewrite our posterior as:

\begin{aligned}p(h|P, F, X) &= p(h|N, P, F, X)P(N|P, F, X) \\ &+ p(h|\bar N,P,F,X)P(\bar N|P,F,X)\end{aligned}

Now, p(h|N,P,F,X) is the posterior distribution you get if you take the sample from the group of Pedros, while p(h|\bar N,P,F,X) is just the prior distribution. Then, intuitively, P(N|P,F,X) is the probability that I am in the group of Pedros.

This looks like what I discussed above, intuitively: if I am in that group of 50 Pedros, then I want to take that sample; if I’m not, then I don’t. And by consistency, this result ought to be equal to the result when applying Bayes. Why isn’t it?

I in fact made a mistake in the analysis of the result of Bayes’ Theorem.

p(\text{Pedro}|h,F,X) is in fact a relevant likelihood function (and it’s most likely a constant). And it’s not given by my sample of 50 Pedros. Rather, p(\text{Pedro}|h,F,X) has to include information from every single Pedro from Brazil. The distribution given by that sample of 50 Pedros is rather p(``\text{He is in this specific group of 50 Pedros}"|h,F,X), which is zero if I’m not in that group! I was applying conditionalisation wrong, because I was conditionalising on the wrong thing!

What’s the moral of this story?

First, pay attention to what you’re conditionalising on. Second, make sure your target-variable is actually affected by the others.

Back to the bridge example. Suppose his bridge is made of wood. Should he try to find p(cost|wood) by getting the cost distribution only from bridges made of wood from his prior database?

How about p(cost|Sagittarius)? Should he try to find it by getting only the data of bridges built by people of Sagittarius?

If in the former case you felt unsure or tempted to do so, I’m certain in the latter case you’d accuse me of throwing out perfectly valid and useful information down the drain, with no gain whatsoever.

Unless, of course, Omega had inserted the future data from my friend’s own bridge into his database. Then, yeah, using information only from people with the same astrological sign as his would actually improve his info.

His bridge is not in that database. You cannot actually just conditionalise on the material he’s going to use, or on his astrological sign, because that information does not translate well to the cost of a bridge that hasn’t been built yet. Because his not-yet-built bridge is not in that database, so he’s actually making an educated guess based on past projects.

His posterior distribution, when conditionalised on (say) the fact that his bridge is made of wood, is a weighted average of the prior distribution over all bridges and the posterior distribution over all wooden bridges, where the weight is given by the probability that his project will be “similar enough” to other wooden projects.

At least, that’s what it looks like to me.

Advertisements
This entry was posted in Mathematics, Probability Theory and tagged , , , , , , , , , . Bookmark the permalink.

5 Responses to Learning Bayes [part 1]

  1. Another Friend says:

    Everything you wrote seems fine, but I don’t think this solves my problem. Actually, the big issue is more… profound. Imagine you have a scatter plot with a sample of size n = 100, showing an apparently linear relation between Y and X, with a certain noise. In other words, the scatter plot of this: Y = α + β * X + ε, α and β being parameters and ε being a noise with null mean and a certain variance. However, when you take a more careful look at it, you happen to see not one line, but two! Each of two interns (say 1 and 2) had collected a sub-sample of size n1 = n2 = 50. If (α1, β1) were actually different from (α2, β2), you would certainly like to estimate them separately. On the other hand, if it turned out this peculiar phenomenon was nothing but an illusion, you would certainly like to estimate your model using the whole sample, for it would provide you with higher precision. How should one make this modeling choice? Frequentist statistics gives convenient answers. Personally, I would add dummy variables to my linear model (two dummies, to detect influence on α and β), indicating whether each point comes from sub-sample 1 or 2, and then estimate the coefficient of these dummies. If they were not significantly different from zero (*sigh*, I am well aware of how flawed this concept is), I’d just drop the dummies and estimate my model with the whole sample. This is the frequentist equivalent of saying: “If I haven’t got a very good reason to think there’s a difference, I’d rather take the more precise option, even if it’s possibly biased”. If I were a Bayesian (aka strictly correct) thinker, what should I do?

    • Another Friend says:

      By the way, when I wrote “I’d rather take the more precise option”, what I really meant was “I’d rather take the potentially more precise option”, or “I’d rather take the option which will be more precise if the coefficients are identical”.

  2. Pingback: Learning Bayes [part 2] | An Aspiring Rationalist's Ramble

  3. Pingback: Learning Bayes [part 2] | An Aspiring Rationalist's Ramble

  4. Pingback: Learning Bayes [part 3] | An Aspiring Rationalist's Ramble

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s