Recently, while reading Jaynes’ Probability Theory: the Logic of Science, I was overcome by an urge to rant about some of the stuff I read there, and one of the things I said was that, basically, orthodox statistics sucks.
Well, first off, what is this orthodox statistics I’m talking about, and what would non-orthodox statistics be?
That… is a very long story. I’ll try to make it more reasonably short. Basically, in the past few hundred years the field of “probability” showed up to measure a whole buncha things, all of them more or less about giving a formal description of how to draw valid conclusions from experiments, what you should believe, etc. It’s supposed to be, in Jaynes’ words, the logic of science.
Enter Sir R. A. Fisher and some others who are very opposed to the idea that probability theory as logic makes any sense, and then they completely destroy the field with nonsense about potentially infinitely repeated experiments, ad hoc tools, and objective probabilities. I’m not going to go much further into this story, but rather show an example of the failures of orthodox statistics, and how Bayesian Probability Theory does it better.
I’ll use Pearson’s (read: chi-squared) test. If you have more than a passing acquaintance with orthodox statistics, you probably know the test. Even if you don’t, it’s not in fact too unlikely that you may have heard of it at some point or another. It’s something called a “test statistic” and it’s what orthodox statisticians use for hypothesis testing.
It’s a purported measure of “goodness of fit” of a hypothesis to the observed data. Basically, given possible outcomes of an experiment repeated a bunch of times (the data), the “null hypothesis” says that the expected number of times the outcome is observed in the experiments is . are the actual observed numbers. The closest the above number is to zero, then, the closest the observed distribution is to the expected one. And in fact, and don’t have to necessarily be numbers of trials; they can be any kind of random variable, as long as they’re all greater than or equal to 5, or they’re all greater than or equal to 1 and at least 20% of them are greater than or equal to 5 (the fact that this caveat exists should start making you suspicious).
Now, orthodox statisticians have developped quite a theory around this. Here’s how it works. There’s a probability density function called the chi-squared distribution, and if the random variables are “independent and standard normal,” the sum of their squares is distributed according to that pdf. Then, one compares the calculated value with that distribution (or, rather, with the cumulative distribution) and figures out what the “p-value” of that test statistic is.
In the above, the degrees of freedom equals intuitively the number of things that “can vary” in the distribution. For example, when throwing a six-sided die, there are n = 6 outcomes but k = 5 degrees of freedom, because if we didn’t observe the five first outcomes, we necessarily observed the sixth. The number k isn’t necessarily always equal to n – 1, however. It is rather n – p where p = s + 1 and s is the number of “co-variates” in the distribution. This isn’t really important to the understanding of the concept, and different fittings have different numbers of co-variates.
But that only moves the question to: what is a p-value?
You might have heard of or seen papers and studies that say that a conclusion was reached with p < 0.05. The intuition behind the p-value is that you take some test statistic (such as the ) and you figure out what the probability that, on the null hypothesis, that test statistic would have a result at least as extreme (i.e. different from expected) as the one observed is. Therefore, in an experiment with for instance one degree of freedom (such as the tossing of a coin), values of greater than approximately 4 have a probability less than 5% of being observed.
And what’s the point of this? Orthodox statistics holds that if the p-value for your chosen statistic is less than some arbitrary number, then you should reject whatever your “null hypothesis” is (one would suppose that in the case of a coin the null hypothesis says the coin is fair). And… that’s it. Orthodox statistics tells you that if, “on chance alone,” there’s a less than (say) 5% probability of observing what you did, you should mark the null hypothesis as false.
Note that, for orthodox statistics, probability is the same thing as frequency. So what they really mean is that, if the results obtained would be observed in less than p (which is usually, as mentioned, 5%) of similar experiments if the null hypothesis were true, then you should reject that hypothesis.
Now, there is a huge number of things wrong with this picture. Let’s see what they all are.
First, the p-value chosen is completely arbitrary. And the usual one, 5%, is, to say the least, huge. One twentieth of correct results would be rejected by that. If you’re not absolutely outraged by this, you should be. If I recall correctly, physics uses a value like 0.01, which is better, but only 5 times so, and still completely arbitrary.
Second, using different test statistics may give you different p-values. If you use the you may get p = 0.07 but using the Student’s t-test will get you p = 0.04 (I’m guessing here when using these two particular examples, for all I know these two may be correlated in a way that makes this impossible, but the general principle holds that two test statistics may not say the same thing). So even the choice of which test statistic to use is arbitrary (somewhat, because some test statistics can’t be used for some kinds of experiments).
Third, the list of possible test statistics to pick is also arbitrary! Why would you choose the as a measure of fit? Why compare your to some imaginary distribution as opposed to comparing, say, , or , or ? Why use a t-distribution? Why those tests and not others? Since these test statistics are really only based upon certain statisticians’ intuitions, one shouldn’t expect to have a consistent answer to this.
Fourth, as I illustrated with the conditions of applicability of the test, they’re not universally valid. Since they’re ad hoc tools, they’re only applicable to some specific domains where they’re supposed to more-or-less work somewhat reliably.
And fifth… a p-value makes absolutely no mention of alternatives or prior knowledge. It says you should “reject” a hypothesis if its arbitrary p-value is below some arbitrary threshold, and “keep” it otherwise, but it doesn’t tell you what to use in its place. To the orthodox statistician, prior knowledge is metaphysical nonsense, and so is to talk about the probability of a hypothesis ! But if you can’t talk about that, then you can’t really say whether you should trust your hypothesis, and what they offer instead are these test statistics that only tell you how well your hypothesis predicted the data, and they don’t do a very good job at that either.
See, this is what I mean when I say sometimes I look at orthodox statistics and screech in frustration.
One could argue that if some hypothesis gives a low p-value, then you could just use the hypothesis instead, which is just the “ensemble of all hypotheses that are not .” That too makes no sense because… well, is an average of every hypothesis that’s not , and it’s quite likely that the vast majority of them are worse fits to the data than itself, which would make have an even lower p-value than .
In Bayesian terms, given some observed data and some statistic , the p-value is defined as , which we’ll call just . Now, which… doesn’t actually make any reference to ! That is, the p-value of some data on some hypothesis is not directly coupled to the p-value of that data on its negation, and they can be both arbitrarily low!
So if you don’t specify an alternative hypothesis, orthodox statistics is silent. It just tells you that its arbitrary, ad hoc tool says your null hypothesis is bad and you should feel bad.
And as I mentioned, we’re not in fact interested in , we’re interested in which is:
So even if we choose to approximate by , this still does mention , which can be less than , and it also mentions the prior knowledge , which is completely ignored by the test statistic itself.
Furthermore, the fact that the statistic is arbitrary means that it will almost surely give you nonsensical results when pushed to a domain it wasn’t specifically designed to deal with, unless by luck you happen to get a test statistic that is actually derivable from Bayes’ Theorem.
I’ll show you both things now with a little story.
Suppose Mr. E is on Earth with a British pound coin and his assigned probabilities are that, after it’s tossed, the probability of heads is 0.499, the probability of tails is 0.499, and the probability of landing on the exact edge is 0.002. However, he tells Ms. M, in Mars, only that he’s running an experiment that has three possible outcomes. By the principle of indifference, Ms. M’s null hypothesis is that each outcome has a 1/3 probability of being observed.
Then Mr. E tosses the coin 29 times, and it turns up heads 14 times, tails 14 times, and edge once. He tells her that, and they calculate the of that data based on their null hypotheses.
Mr. E’s expected results are 29*0.499 heads, 29*0.499 tails, and 29*0.002 edges. His is, then:
Now, this is a bit disconcerting. If you look at a chi-squared distribution table, it tells us that after observing this data, Mr. E should reject his hypothesis as false!
But wait, it gets worse. Now Ms. M calculates her own :
The table also says that her hypothesis should be rejected, true, but the value of the says that it’s a better fit than Mr. E’s! She will look at her , look at Mr. E’s, and say, “Well, our hypotheses are both fairly flawed, but yours is certainly worse than mine!”
That’s clearly preposterous. Most people trained to use the test will be very surprised by this, and will try to check this calculation, but it’s correct, and it’s just what one would expect from using completely arbitrary tools not derived from basic principles. Note that this isn’t even outside of the scope of the test, given that 2/3 of the quantities are greater than 5 and 1/3 is equal to 1.
How would a Bayesian do it, then?
Well, I have mentioned that the strength of a hypothesis can only be measured against other hypotheses. So Bayes will never tell you to outright reject some hypothesis unless it can clearly indicate you an alternative.
And with that philosophy… it turns out that we can, in fact, devise a test that’s more or less like a test statistic, but has the advantages of being derivable from first principles (which means it is universally valid) and it doesn’t actually say “reject this hypothesis” or anything like that, but rather what says “this data can not support competing hypotheses by more than some given amount.”
I’ll explain. I’ve talked about the evidence which is exactly equivalent to Bayes’ Theorem. Suppose I have some background information , some data , and a hypothesis . Then, if I’m comparing it to any single alternative hypothesis (that is, in my restricted universe of discourse the negation of is ), I have that:
Now I’m going to rewrite the above as
Since and are both , their logarithms are negative or zero, so the first term of the above is positive and the second is negative. Now, hold fixed, and let range over every possible hypothesis (we can compare hypotheses pairwise because they really only compete with each other pairwise). attains its maximum value when is the “sure thing” hypothesis that says that everything that was observed couldn’t have been otherwise, . In that case, we have and can define the following quantity:
necessarily. This means that there is no possible alternative hypothesis which data could support, relative to , by more than decibels. This is to say that Probability Theory cannot answer “How well does data support hypothesis ?” because that question makes no sense; rather, it answers “Are there any alternatives which data would support relative to ? How much support is possible?”
And this is what I’d call a “Bayesian test statistic.” It’s a number that depends exclusively on the hypothesis under consideration and the data, and the closer it is to zero, the better hypothesis fits data . And unlike the test, this was derived directly from Bayes’ Theorem, which means that it will not suffer from the weaknesses of the ad hoc orthodox tools. For any given hypothesis, the closest that number is to zero, the best fit it is. So, even though this statistic doesn’t directly mention an alternative hypothesis, it’s still Bayesian in spirit because it’s based on an implicit class of alternative hypotheses.
And this number is also, unlike the , universal. You can use it for any class of hypotheses. But if we do restrict our attention to the domain where the is applicable, we can actually do even better than the .
Suppose the hypothesis is in the “Bernoulli class” (that is, consecutive trials with possible outcomes, each outcome with a probability that’s independent of other outcomes, like the toss of a coin). If we call the trial , then the data is just all of those outcomes one after the other, and we have that
where the are the number of times the outcome was observed in those trials, and are the probabilities that a given trial had that outcome, with and being the total number of trials.
Now, given any observed sequence of outcomes , the hypothesis that fits it best is the one that predicts the exact observed frequencies. To show this, let’s call the observed frequencies . Using the fact that, on the positive real line, with equality iff x = 1, and choosing , we have that
with equality iff , and with m being the number of possible outcomes. Why is this relevant? If we call the hypothesis of perfect fit:
So for any hypothesis H, the above is 0 only when , which means the hypothesis with the best possible fit in the Bernoulli class is exactly .
Now, let’s bring back:
But if you take to be , then that’s just the thing we just derived. So we have a refinement:
The closest this number is to zero, the better your hypothesis fits the data. And it’s stronger than the , because it restricts its attention to the Bernoulli class, and perfect fit means that the observed frequencies were exactly the predicted ones.
Now let’s see an interesting thing. Define a quantity:
With this we can find the following three identities:
And let’s take a look at our again. We can reexpress it using the last two identities:
Using the Taylor expansion of the logarithm around 1, we have that . Note that this expansion is only valid when is relatively small (it has to be between -1 and 1). We can use that:
The second line is because we’re taking the logarithm to base 10 instead of the natural logarithm so we have to divide everything by , and the last equality is true because of our first identity. Now if you expand that last sum a bit:
But wait. are just the observed outcomes and are the expected outcomes of the test! And this means that:
Or, in other words, the surprising result is that !
Well. Not really. See, what’s really the case is that is a second-order approximation to when the condition that all the have modulus less than one is met. And predictably, when that condition is not met, we will find the test to be lacking.
Let’s get back to our tossed coin example. What are the associated ?
The results agree much more nicely with our intuitions. What Ms. M finds out is that there is another hypothesis about the coin in the Bernoulli class that’s 35.2dB better than hers (that’s odds of over 3300:1), and that Mr. E’s hypothesis is better than hers by some 26.8dB and is only 8.34dB away from the best hypothesis in the Bernoulli class.
Why does this happen? Why is the test saying such outrageously different things?
There are two reasons. The first is that the squared part in the sum severely overpenalises outliers. The second is that then you divide that difference by the expected value, which also contributes to that. To see this, let’s suppose instead of observing an edge, we observed tails, so that there were 14 heads and 15 tails. In this case:
Now they agree. You see, under Mr. E’s hypothesis, he should’ve observed 14.471 heads, 14.471 tails, and 0.058 edges, and the amplified hugely that unexpected outcome of 1 edge.
A Bayesian test statistic talks only about comparisons between hypotheses, and Bayes would never tell you to reject a hypothesis unless it had a better alternative to offer. Furthermore, even then it wouldn’t say “reject it,” it would rather say “this hypothesis is a better fit for the data than that one,” and then you’d need to integrate your prior knowledge into it to know exactly what your posterior probabilities for the hypotheses ought to be.