A few months ago, someone who used to be called perversesheaf came to tumblr to bash LessWrong there. Now, while there is a very large number of criticisms that can be aimed at it, both as a website and community, this person decided to bash Bayesianism, which, as readers of this blog might have noticed, is something I personally believe is probably Correct. That person has since deleted their blog, but Scott’s tumblr still has the original post, which I’m going to reproduce here, in a fashion. But first, the basics.
There is a principle in statistics called the Likelihood Principle whose basic content is that, “given a statistical model, all of the evidence in a sample relevant to model parameters is contained in the likelihood function.” You will remember that if we’re trying to estimate parametres after observing some data then the likelihood function is the following quantity:
when seen as a function of . And in particular, any function that has the form where is any function of the data alone and not of the parametres can be seen as a likelihood function.
More intuitively, we say that a function is a probability in a case where we’re wondering about the outcome when the parametres are held fixed: “If a coin was tossed ten times (experiment) and it is fair (parametre), what’s the probability that it lands heads every time (outcome)?”; and it’s a likelihood in a case where we’re wondering about the parametres when the outcome is held fixed: “If a coin was tossed ten times (experiment) and it landed heads every time (outcome), what’s the likelihood that it is fair (parametre)?”
So the likelihood principle, then, says that all the information a sample can give about a parametre is contained in the function . And if one believes that Bayes’ Theorem is the correct way to deal with uncertainty, then one must, necessarily, believe that the likelihood principle is true, since:
That is, our posterior beliefs equals our prior beliefs times a likelihood function times a normalisation constant.
Let’s look at the W’s practical example. Suppose I run two experiments:
- A (possibly biased) coin was tossed times and came out heads times;
- Another (possibly biased) coin was tossed until it came out heads times, and was tossed a total of times.
I want to know how much information each of those experiments give me about the probabilities that each of those coins will come out heads. In both cases, by Bayes’ Theorem:
The two experiments are not the same, though, so even if their prior distributions are equal, their posteriors may not be. Let’s see what the likelihood function for either experiment is.
The first experiment is just a series of independent Bernoulli trials, so our likelihood function is:
In the second experiment, we toss the second coin until we have observed heads. Therefore, our likelihood function is:
Now, suppose I ran experiment 1 and observed heads. Then I ran experiment 2 and had to toss the coin times. The likelihood functions are respectively:
The two likelihoods are proportional, so the likelihood principle says we should draw the same conclusions from them. Now, if , our posteriors for both cases won’t necessarily be the same, but the point is that the impact of both experiments in our beliefs about their respective is exactly the same in both cases.
It intuitively makes sense, though. The difference between both of those experiments are in when we decided to stop, in advance. The actual results of both experiments were the same: tosses, heads. The fact that, in my mind, I had decided in advance I’d stop when this or that happened, has no bearing on our actual observed results, and thus should have no impact on our actual beliefs about the actual parametre .
Why did I write that last paragraph?
Because in frequentist p-value analyses, those two outcomes are different. Consider the case where we’re trying to decide whether the null hypothesis that the coin is fair, , is true, as opposed to . A p-value analysis goes:
- Set a significance level in advance of the experiment. Typical values are 0.05 and 0.01;
- Calculate the probability that an outcome at least as extreme as the observed one would have happened if the null hypothesis was true;
- If that probability is less than the significance level, reject the hypothesis; otherwise, nothing can be said.
What are the p-values of both experiments? Inserting our values in the likelihood functions, we have:
We calculated these probabilities because “at least as extreme” varies depending on which experiment we’re running, and which hypotheses we’re testing against each other. In the first experiment, observing or less heads after tossing the coin times is an extreme result for compared to ; in the second, having to toss the coin or more times when we’re just waiting for heads is the extreme result instead.
Based on that, the first experiment doesn’t reject the null hypothesis but the second one does, at the 5% significance level. Even though they had the exact same outcome. Using this test, the prior, completely mental choice we made about which experiment to run, completely unrelated to our beliefs about , would somehow influence our beliefs about whether or is true.
I’ve spoken before about how Bayesianism can only compare hypotheses to each other. They don’t exist in a vacuum, and a Bayesian procedure would never outright tell you to “reject a hypothesis.” It can tell you which hypotheses are more or less likely, when you give it a set of possibilities, a prior distribution over these possibilities, and a likelihood. That’s all it can do.
But frequentist statistics aims to do that, and for that it takes into account stuff that to a Bayesian has nothing to do with the experiment. It takes your mind states, prior to running the experiment, as part of the experiment. It takes possible distributions of data that haven’t happened. And then they criticise Bayesianism for being too subjective.
Okay, I guess.
We have to remember that frequentism and Bayesianism are different things, that answer different questions, whose basic object of study just happens to have the same name – probability – but is not the same thing at all. To a frequentist, it’s a limiting frequency; to a Bayesian, it’s a measure of uncertainty. They agree a lot, but sometimes they don’t.
Now let’s see perversesheaf’s post, entitled “A problem with the likelihood principle”:
As someone interested in both philosophy and mathematics, I am disturbed by the rise of what I call “radical Bayesian,” or the belief that the standard probability axioms and Bayes’ rule together give us a complete description of how we ought to reason about the world (at least in principle). I obviously blame Yudkowsky and LessWrong for the bulk of this, though similar lines of thought have been advocated by certain philosophers and seem to have trickled down to undergraduate philosophy majors who’ve never read LW.
(I’d just like to pause this to snicker a bit at the fact that perversesheaf thinks LW is as big as that. Like, forreal, radical Bayesianism has been around for much longer, and the LW community is wayyyy too tiny to be responsible for even a percent of that. And in fact, as far as I know, Yudkowsky himself was influenced in no small part by other AI researchers, since Bayesianism seems to be the rule in the AI and Machine Learning community [EDIT: This might not be true].)
In my view, the basic error here is not new and is also made by, for example, consequentialists and some of the more radical libertarians. First, take a complicated human endeavor (epistemology, morality, politics) and create a seductive, simple model that works on a variety of easy test cases. Second, wildly overgeneralize and claim this simple model constitutes a complete theory of the endeavor in question. Third, when presented with counterexamples that evoke strong intuitions that disagree with the results your model gives, don’t admit the model is incomplete or explain away those intuitions. Instead, dogmatically stick to your guns and claim those intuitions are normatively wrong in virtue of the fact that they disagree with your model (e.g., consequentialism and the dust speck thought experiment).
(I’d like to point a few things out. First, the dust speck thought experiment is about utilitarianism, not consequentialism. The former is a (tiny, tiny) subset of the latter. Second, obviously for any consistent normative moral theory, there are going to be counterintuitive thought experiments, because intuitive moral theory is inconsistent. Third, this is true of every other attempt to create a consistent (even if not necessarily simple) model of normative anything, because human intuition is inconsistent in general, and morality and inference are just two parts of it where this is glaring.)
It’s not that I don’t think Bayesian thinking (or consequentialist or libertarian thinking) has [anything] to offer. It’s incredibly useful, both practically and theoretically. Rather, I don’t think it gives a complete picture of how we ought to reason. There are good reasons to believe it doesn’t. But everyone likes simple, clear-cut answers. We desperately want to resolve the deep epistemological questions that weigh on our minds, and I think this need for resolution blinds us and allows us to be duped by purported grand unified theories of rationality (or morality, or politics, etc).
So instead of what I regard as the sensible, moderate position regarding Bayesian methods — they’re often very useful but sometimes inappropriate — many people end up being radical Bayesians and believe that Bayes’ rule presents The Final Answer to the question of how we should reason about uncertainty. It is the latter position I disagree with.
I think su3su2u1 (and nostalgebraist and a few others) have done an admirable job of pushing back against the extreme view. In particular, su3su2u1 often brings up difficulties with prior choice and how these are problematic for Bayesianism as a complete theory of reasoning. In this post, I want to raise a different problem that I haven’t seen on tumblr (at least among the blogs I read).
(The discussion with su3su2u1 is part of what inspired me to write this post. Basically, my response to this is that Bayesianism is a philosophical position, but in full generality it is literally uncomputable, actually not even approximable, and to people who think that the correct way to Bayes is using it in all practical cases (both critics and adherents) I say: ha.)
The problem has two parts. First, Bayesians are committed to the (strong) likelihood principle. Second, the (strong) likelihood principle is inadequate for reasoning about hypothesis testing situations where a stopping rule is involved. Deborah Mayo presents this argument in her book Error and the Growth of Experimental Knowledge (though it is not original to her — see the book for complete bibliographic information). I will present my own take on it, along with some mathematical justification she omitted.
By the likelihood principle, I mean the claim that when performing inference about an unknown parameter, only the observed data matters, and all the relevant information is contained in the likelihood function. Regarding my first claim, any Bayesian is, by definition, committed to this principle. After all, their rule for inference is “posterior = prior x likelihood.” The data affect the inference only through the likelihood function.
It’s worth comparing the likelihood principle to frequentist hypothesis testing to draw out the difference. Suppose I am working for a pharmaceutical company and asked to examine data from a drug trial to determine if the drug is effective. I am handed a notebook with 97 data points. If I wanted to do Bayesian hypothesis testing, my task is relatively straightforward. I draw up a model, find a prior, compute the likelihood, compute the posterior, then find (for example) a 95% percent highest posterior density interval. If this interval is entirely positive and doesn’t contain zero, then I conclude I should my belief that the drug is effective at (at least) 95% confidence. Bam, done.
Computing a frequentist p-value is a little trickier because, in addition to setting up a model, I need to concern myself with not just the data, but various counterfactual possibilities: ways the data could have been. A p-value is, by definition, the probability of the observed data given that the null hypothesis is true. The null hypothesis in this case includes the testing procedure. So I need to know, for example, whether the trial was originally planned to have 97 samples, or if it was stopped after 97 samples because the company thought the data observed so far was sufficient to show the drug was effective. Each of these cases constitutes a different null hypothesis and gives rise to a different p-value. A concrete example of how these stopping rules matter, worked out in detail and contrasted to the Bayesian approach, is given on this Wikipedia page: [link].
For a Bayesian, how the decision was made to stop data collection is irrelevant. All that matters is the observed data. My goal is to show that this ignorance of the experimental design can lead to ridiculous, obviously unwarranted inferences.
Suppose the certain drug company is unethical and wants to trick the FDA into thinking a placebo pill is actually an effective way to make people happier. (Suppose the placebo is given to subjects without any indication of what the drug is for, so that we may ignore the placebo effect.) To do this, the company hires a radical Bayesian statistician (who, by definition, will not concern themselves with stopping rules) and orders a trial of the drug to proceed as follows: the drug will be tested on subjects one by one, in order, and the experiment will be stopped only when the statistician has a 95% Bayesian credence that the drug is effective. (This number can be made arbitrarily close to 100.)
(I wonder if it can really be made arbitrarily close to 100? I think so, but not totally sure.)
You might wonder if this is always possible. After all, doesn’t the law of large numbers (or the central limit theorem, etc) show that, over time, the data should reflect the fact that the placebo has no effect? Yes, but under certain weak assumptions, it turns out there is enough variance in the experiment to make this unscrupulous stopping rule work every time, given an unlimited number of subjects.
Let’s suppose that, because of natural variation in happiness, some subjects become happier after the drug is administered, and an equal number become less happy. In fact, let’s assume that this natural variation is normally distributed with mean 0 and variance 1 on some suitable quantitative scale of happiness. (This set-up obviously relies on some dubious assumptions about how happiness works, but you can ignore the storytelling if you want and just concentrate on the fact we’re doing inference on i.i.d. normally distributed random variables with unknown mean.) Let’s also assume that the Bayesian statistician models the drug’s effectiveness as a normal distribution with known variance 1 and unknown mean and starts with a non-informative flat (improper) prior on the mean.
I want to pause here to emphasize that these assumptions on the model and prior are not necessary to make the argument work and only serve to make the math easier. The model could be unknown mean and unknown variance, for instance. And, if you don’t like improper priors, we could use a proper prior with really large variance to approximate an improper prior. Or we could use any normal prior, since various asymptotic results show the prior will “wash out” and become irrelevant when the sample size is large. We could even use certain more sophisticated models. But the essential point is the same.
Returning to the simplified set up, let’s suppose we are midway through the experiment and have taken samples with mean happiness change . The statistician’s posterior distribution is then (cf. the third edition of Bayesian Data Analysis by Gelman et al., page 52, though the computation is straightforward and I’d be happy to provide the details if asked):
A 95% Bayesian credence interval is then given by .
(I recently explained the difference between confidence and credible intervals, but one of the things mentioned there is that frequentist confidence and Bayesian credible intervals agree exactly in the above case.)
It remains to show that there is always an such that
If we define as the sum of the first random variables in an infinite sequence of independent, identically distributed random variables with distribution , this is equivalent to asking if there is always such that
This in turn is equivalent to asking if the event
has probability one. An argument using Kolmogorov’s zero-one law shows that it does. See here for details: [link].
So the evil drug company can always arrange the trial to trick the Bayesian statistician into inferring that the placebo has a positive effect. The drug company could even tell the statistician they will stop the trial only when they get favorable results, as this won’t affect the Bayesian’s inference. (Of course, if they let on that the drug is known to be a placebo and hence affect the statistician’s priors, they’re screwed.)
(I will come back to this parenthesis later.)
The upshot is that Bayesian reasoning can result in horrifically poor inferences when stopping rules are involved.
I want to be explicit that my position is not “Bayesianism is flawed, therefore we should all be frequentists.” While it does accommodate stopping rules well, frequentism has problems of its own. Rather, my suggestion is that we should recognize that Bayesian reasoning, while useful, has certain shortcomings. We should not treat it as The Final Answer to statistical inference. That is, we should reject radical Bayesianism. Epistemology is hard. There aren’t simple answers.
I think perversesheaf is wrong. We had a ~2 month discussion about it, and in the end he disappeared. I hope he’s okay. Anyway, let’s see what can be done here.
His derivation is sound. That event does in fact happen with probability . However, we need to flesh this out a bit more fully before we can deal with it properly. What does it mean to say that “the Bayesian was fooled”?
I can think of two answers. One, “the Bayesian believes the drug has a positive effect when it does not with 95% confidence.” Two, “the Bayesian believes the drug’s effect is higher than it actually is with 95% confidence.” The former case can only be used if the drug really really has no effect at all, or has a negative effect, whereas the latter can be used at any point and it encompasses the former. In perversesheaf’s thought experiment, the drug’s effect was zero, but for now I’ll pretend I don’t actually know what the effect is.
Then we calculate that sum on the right-hand side at the bottom for various possible values of the true effect , and its plot is given by:
So the upper bound drops below somewhere between and , and for the effect becomes negligible.
Now I’m going to derive an upper bound to the probability that the Bayesian will be fooled in the second sense for positive true effect . Let be the sum of the first samples from a distribution. Perversesheaf’s proof is equivalent to:
Reggie’s proof is equivalent to:
I want to derive:
In other words, I want to derive the probability that the Bayesian is fooled in the second sense conditional on this stopping rule having actually worked, as a function of the true value . In specific, I’ll focus on the positive case, since Reggie took care of the negative one for me. For ease of notation, let’s call our propositions:
Where is “the Bayesian is fooled when the true effect is ” and is “the Bayesian is fooled when the true effect is . Using Bayes’ Theorem, we want to calculate:
Since I’m only interested in the case where , and so . Furthermore, it can be shown that . Therefore:
The numerator is independent of , but the denominator is not. This can be fixed by noting that, since , , whence:
And the right-hand side of the above is plotted as a function of below:
At first, the upper bound is not terribly tight, but eventually it approaches a number we can calculate. Intuitively, for large enough , a single sample is enough for perversesheaf’s stopping rule to apply, so a single sample will be collected and the Bayesian’s 95% credible interval will be greater than . But in that case, the only way the Bayesian can be fooled (in the second sense) is if that single sample is greater than , and we know the probability for that: .
Thus, for a large enough true effect, the Bayesian will be fooled only 2.3% of the time. With my upper bound, “large enough” is or greater; however, like I said, my bound is not very tight, so it’s probably quite a bit less than that.
Okay, that’s all fine and dandy, I’ve proven that for high enough values of the Bayesian won’t be fooled, and Reggie’s proven that for low enough values of the Bayesian won’t be fooled. But that’s just dodging the problem! From its description, we know that , so the Bayesian is fooled anyway!
Not quite. See, the point of those derivations wasn’t to just “neener neener,” there was a purpose there.
I commented, above, that I’d come back to a certain parenthesis perversesheaf made. You can go back and remind yourself of what it was, I’ll wait. Done? Okay.
As perversesheaf said, if the evil drug company lets on to the Bayesian that they know the drug to be a placebo, that’ll affect the Bayesian’s priors, and then they’re screwed. Yet he won’t let the knowledge of the stopping rule itself affect the Bayesian’s priors!
The likelihood principle works because, normally, . In other words, knowledge of a stopping rule does not typically affect the likelihood of an experiment, as I’ve shown in the coin tosses example. And indeed, it does not directly affect it, but there’s a way it can affect exactly the thing perversesheaf was afraid of affecting: the priors.
In the little Bayesian Network with labelled arrows I drew above, we can see a way in which knowledge of the stopping rule can affect knowledge of : if the choice of the stopping rule was influenced by knowledge of the true parametre, then knowledge of the stopping rule affects our priors for the parametre! And while in a case such as the coin tossing one it might be reasonable to suppose that no such knowledge exists, in this case we might very well suppose that it does!
Why? Because of the whole analysis Reggie and I did up there. This stopping rule only has a reasonable chance of fooling the Bayesian, in the case where the true effect is zero or near zero. And even more than that, just a prior belief that this stopping rule is more likely to be chosen when someone wants to fool us than otherwise is enough to change our prior and screw things up.
Sanity check: Suppose we purposefully sever the “Knowledge” arrow in the above network; that is, suppose we change the story a bit so that it was the Bayesian themself who chose that stopping rule, arbitrarily. So there’s no evil drug company, just you, a bunch of data, and you decide to use this stopping rule just for the lulz, without any prior knowledge of the true effect. Then the Bayesian’s posterior belief is clearly no longer “foolish”! They don’t know what the true effect is, they don’t have any prior beliefs that would justify that true effect being this or that value, so their posterior really does reflect the appropriate conclusions they should draw. Actually, the fact that the stopping rule worked at all basically rules out any value for below , so by renormalisation alone that would be enough for your point estimate to be greater than and maybe your 95% credible interval as well.
And this also reflects out realistic intuitions very well. If I looked at the stopping rule someone used and did not suspect it was meant to tamper with my confidence, I would carry on as usual; if I did suspect it was meant to tamper with my confidence, that’d be extra information about the estimated parametre that I can use as part of my prior.
Another interesting thing is that, “natively,” frequentist methods don’t know how to deal with stopping rules either; you need to change them appropriately, let them take the stopping rule into account, so it’s only fair that you let the Bayesian do the same.
Final conjecture: This is always the case. Whenever we intuitively think that a stopping rule will fool the Bayesian, either it will affect the Bayesian’s prior and they won’t be fooled, or what we’re calling “being fooled” is not actually being fooled at all. If you allow your Bayesian statistician to be at least as smart as yourself, they can predict everything you can.