## Orthodox test statistics and the absence of alternatives

Recently, while reading Jaynes’ Probability Theory: the Logic of Science, I was overcome by an urge to rant about some of the stuff I read there, and one of the things I said was that, basically, orthodox statistics sucks.

Well, first off, what is this orthodox statistics I’m talking about, and what would non-orthodox statistics be?

That… is a very long story. I’ll try to make it more reasonably short. Basically, in the past few hundred years the field of “probability” showed up to measure a whole buncha things, all of them more or less about giving a formal description of how to draw valid conclusions from experiments, what you should believe, etc. It’s supposed to be, in Jaynes’ words, the logic of science.

Enter Sir R. A. Fisher and some others who are very opposed to the idea that probability theory as logic makes any sense, and then they completely destroy the field with nonsense about potentially infinitely repeated experiments, ad hoc tools, and objective probabilities. I’m not going to go much further into this story, but rather show an example of the failures of orthodox statistics, and how Bayesian Probability Theory does it better.

I’ll use Pearson’s $\chi^2$ (read: chi-squared) test. If you have more than a passing acquaintance with orthodox statistics, you probably know the $\chi^2$ test. Even if you don’t, it’s not in fact too unlikely that you may have heard of it at some point or another. It’s something called a “test statistic” and it’s what orthodox statisticians use for hypothesis testing.

$\chi ^2 \equiv \sum\limits_{i=1}^m\frac{(O_i-E_i)^2}{E_i}$

It’s a purported measure of “goodness of fit” of a hypothesis to the observed data. Basically, given $m$ possible outcomes of an experiment repeated a bunch of times (the data), the “null hypothesis” says that the expected number of times the $i^{th}$ outcome is observed in the experiments is $E_i$. $O_i$ are the actual observed numbers. The closest the above number is to zero, then, the closest the observed distribution is to the expected one. And in fact, $E_i$ and $O_i$ don’t have to necessarily be numbers of trials; they can be any kind of random variable, as long as they’re all greater than or equal to 5, or they’re all greater than or equal to 1 and at least 20% of them are greater than or equal to 5 (the fact that this caveat exists should start making you suspicious).

Now, orthodox statisticians have developped quite a theory around this. Here’s how it works. There’s a probability density function called the chi-squared distribution, and if the random variables are “independent and standard normal,” the sum of their squares is distributed according to that pdf. Then, one compares the calculated value with that distribution (or, rather, with the cumulative distribution) and figures out what the “p-value” of that test statistic is.

In the above, the degrees of freedom equals intuitively the number of things that “can vary” in the distribution. For example, when throwing a six-sided die, there are n = 6 outcomes but k = 5 degrees of freedom, because if we didn’t observe the five first outcomes, we necessarily observed the sixth. The number k isn’t necessarily always equal to n – 1, however. It is rather n – p where p = s + 1 and s is the number of “co-variates” in the distribution. This isn’t really important to the understanding of the concept, and different fittings have different numbers of co-variates.

But that only moves the question to: what is a p-value?

You might have heard of or seen papers and studies that say that a conclusion was reached with p < 0.05. The intuition behind the p-value is that you take some test statistic (such as the $\chi^2$) and you figure out what the probability that, on the null hypothesis, that test statistic would have a result at least as extreme (i.e. different from expected) as the one observed is. Therefore, in an experiment with for instance one degree of freedom (such as the tossing of a coin), values of $\chi^2$ greater than approximately 4 have a probability less than 5% of being observed.

And what’s the point of this? Orthodox statistics holds that if the p-value for your chosen statistic is less than some arbitrary number, then you should reject whatever your “null hypothesis” is (one would suppose that in the case of a coin the null hypothesis says the coin is fair). And… that’s it. Orthodox statistics tells you that if, “on chance alone,” there’s a less than (say) 5% probability of observing what you did, you should mark the null hypothesis as false.

Note that, for orthodox statistics, probability is the same thing as frequency. So what they really mean is that, if the results obtained would be observed in less than p (which is usually, as mentioned, 5%) of similar experiments if the null hypothesis were true, then you should reject that hypothesis.

Now, there is a huge number of things wrong with this picture. Let’s see what they all are.

First, the p-value chosen is completely arbitrary. And the usual one, 5%, is, to say the least, huge. One twentieth of correct results would be rejected by that. If you’re not absolutely outraged by this, you should be. If I recall correctly, physics uses a value like 0.01, which is better, but only 5 times so, and still completely arbitrary.

Second, using different test statistics may give you different p-values. If you use the $\chi^2$ you may get p = 0.07 but using the Student’s t-test will get you p = 0.04 (I’m guessing here when using these two particular examples, for all I know these two may be correlated in a way that makes this impossible, but the general principle holds that two test statistics may not say the same thing). So even the choice of which test statistic to use is arbitrary (somewhat, because some test statistics can’t be used for some kinds of experiments).

Third, the list of possible test statistics to pick is also arbitrary! Why would you choose the $\chi^2$ as a measure of fit? Why compare your $\chi^2$ to some imaginary $\chi^2$ distribution as opposed to comparing, say, $\chi$, or $\chi^3$, or $\chi^4$? Why use a t-distribution? Why those tests and not others? Since these test statistics are really only based upon certain statisticians’ intuitions, one shouldn’t expect to have a consistent answer to this.

Fourth, as I illustrated with the conditions of applicability of the $\chi^2$ test, they’re not universally valid. Since they’re ad hoc tools, they’re only applicable to some specific domains where they’re supposed to more-or-less work somewhat reliably.

And fifth… a p-value makes absolutely no mention of alternatives or prior knowledge. It says you should “reject” a hypothesis if its arbitrary p-value is below some arbitrary threshold, and “keep” it otherwise, but it doesn’t tell you what to use in its place. To the orthodox statistician, prior knowledge is metaphysical nonsense, and so is to talk about the probability of a hypothesis $P(H|X)$! But if you can’t talk about that, then you can’t really say whether you should trust your hypothesis, and what they offer instead are these test statistics that only tell you how well your hypothesis predicted the data, and they don’t do a very good job at that either.

See, this is what I mean when I say sometimes I look at orthodox statistics and screech in frustration.

One could argue that if some hypothesis $H$ gives a low p-value, then you could just use the hypothesis $\bar H$ instead, which is just the “ensemble of all hypotheses that are not $H$.” That too makes no sense because… well, $\bar H$ is an average of every hypothesis that’s not $H$, and it’s quite likely that the vast majority of them are worse fits to the data than $H$ itself, which would make $\bar H$ have an even lower p-value than $H$.

In Bayesian terms, given some observed data $D$ and some statistic $s(D)$, the p-value is defined as $P(s(x) \text{ at least as extreme as } s(D) | HX)$, which we’ll call just $P(s|HX)$. Now, $P(s|\bar HX) = \frac{P(\bar H|sX)P(s|X)}{P(\bar H|X)}$ which… doesn’t actually make any reference to $P(s|HX)$! That is, the p-value of some data on some hypothesis $H$ is not directly coupled to the p-value of that data on its negation, and they can be both arbitrarily low!

So if you don’t specify an alternative hypothesis, orthodox statistics is silent. It just tells you that its arbitrary, ad hoc tool says your null hypothesis is bad and you should feel bad.

And as I mentioned, we’re not in fact interested in $P(s|HX)$, we’re interested in $P(H|DX)$ which is:

$P(H|DX) = \frac{P(D|HX)P(H|X)}{P(D|HX)P(H|X)+P(D|\bar HX)P(\bar H|X)}$

So even if we choose to approximate $P(D|HX)$ by $P(s|HX)$, this still does mention $P(s|\bar HX)$, which can be less than $P(s|HX)$, and it also mentions the prior knowledge $P(H|X)$, which is completely ignored by the test statistic itself.

Furthermore, the fact that the statistic is arbitrary means that it will almost surely give you nonsensical results when pushed to a domain it wasn’t specifically designed to deal with, unless by luck you happen to get a test statistic that is actually derivable from Bayes’ Theorem.

I’ll show you both things now with a little story.

Suppose Mr. E is on Earth with a British pound coin and his assigned probabilities are that, after it’s tossed, the probability of heads is 0.499, the probability of tails is 0.499, and the probability of landing on the exact edge is 0.002. However, he tells Ms. M, in Mars, only that he’s running an experiment that has three possible outcomes. By the principle of indifference, Ms. M’s null hypothesis is that each outcome has a 1/3 probability of being observed.

Then Mr. E tosses the coin 29 times, and it turns up heads 14 times, tails 14 times, and edge once. He tells her that, and they calculate the $\chi^2$ of that data based on their null hypotheses.

Mr. E’s expected results are 29*0.499 heads, 29*0.499 tails, and 29*0.002 edges. His $\chi^2$ is, then:

$\chi^2_E = 2\frac{(14-29*0.499)^2}{29*0.499}+\frac{(1-29*0.002)^2}{29*0.002} = 15.33$

Now, this is a bit disconcerting. If you look at a chi-squared distribution table, it tells us that after observing this data, Mr. E should reject his hypothesis as false!

But wait, it gets worse. Now Ms. M calculates her own $\chi^2$:

$\chi^2_E = 2\frac{\left(14-29*\frac 1 3\right)^2}{29*\frac 1 3}+\frac{\left(1-29*\frac 1 3\right)^2}{29*\frac 1 3} = 11.66$

The table also says that her hypothesis should be rejected, true, but the value of the $\chi^2$ says that it’s a better fit than Mr. E’s! She will look at her $\chi^2$, look at Mr. E’s, and say, “Well, our hypotheses are both fairly flawed, but yours is certainly worse than mine!”

That’s clearly preposterous. Most people trained to use the $\chi^2$ test will be very surprised by this, and will try to check this calculation, but it’s correct, and it’s just what one would expect from using completely arbitrary tools not derived from basic principles. Note that this isn’t even outside of the scope of the test, given that 2/3 of the quantities are greater than 5 and 1/3 is equal to 1.

How would a Bayesian do it, then?

Well, I have mentioned that the strength of a hypothesis can only be measured against other hypotheses. So Bayes will never tell you to outright reject some hypothesis unless it can clearly indicate you an alternative.

And with that philosophy… it turns out that we can, in fact, devise a test that’s more or less like a test statistic, but has the advantages of being derivable from first principles (which means it is universally valid) and it doesn’t actually say “reject this hypothesis” or anything like that, but rather what says “this data can not support competing hypotheses by more than some given amount.”

I’ll explain. I’ve talked about the evidence which is exactly equivalent to Bayes’ Theorem. Suppose I have some background information $X$, some data $D$, and a hypothesis $H$. Then, if I’m comparing it to any single alternative hypothesis $H'$ (that is, in my restricted universe of discourse the negation of $H$ is $H'$), I have that:

$e(H|DX)=e(H|X)+10\log_{10}P(D|HX)-10\log_{10}P(D|H'X)$

Now I’m going to rewrite the above as

$e(H|DX) = e(H|X)-\psi$

where

$\psi\equiv-10\log_{10}P(D|HX)+10\log_{10}P(D|H'X)$

Since $P(D|HX)$ and $P(D|H'X)$ are both $\leq 1$, their logarithms are negative or zero, so the first term of the above is positive and the second is negative. Now, hold $H$ fixed, and let $H'$ range over every possible hypothesis (we can compare hypotheses pairwise because they really only compete with each other pairwise). $\psi$ attains its maximum value when $H'$ is the “sure thing” hypothesis that says that everything that was observed couldn’t have been otherwise, $P(D|H'X) = 1$. In that case, we have $\log_{10}P(D|H'X)=0$ and can define the following quantity:

$\psi_{\infty}\equiv-10\log_{10}P(D|HX)>0$

And thus

$e(H|DX)\geq e(H|X)-\psi_{\infty}$

necessarily. This means that there is no possible alternative hypothesis which data $D$ could support, relative to $H$, by more than $\psi_{\infty}$ decibels. This is to say that Probability Theory cannot answer “How well does data $D$ support hypothesis $H$?” because that question makes no sense; rather, it answers “Are there any alternatives $H'$ which data $D$ would support relative to $H$? How much support is possible?”

And this is what I’d call a “Bayesian test statistic.” It’s a number that depends exclusively on the hypothesis under consideration and the data, and the closer it is to zero, the better hypothesis $H$ fits data $D$. And unlike the $\chi^2$ test, this was derived directly from Bayes’ Theorem, which means that it will not suffer from the weaknesses of the ad hoc orthodox tools. For any given hypothesis, the closest that number is to zero, the best fit it is. So, even though this statistic doesn’t directly mention an alternative hypothesis, it’s still Bayesian in spirit because it’s based on an implicit class of alternative hypotheses.

And this number is also, unlike the $\chi^2$, universal. You can use it for any class of hypotheses. But if we do restrict our attention to the domain where the $\chi^2$ is applicable, we can actually do even better than the $\psi_{\infty}$.

Suppose the hypothesis $H$ is in the “Bernoulli class” (that is, $n$ consecutive trials with $m$ possible outcomes, each outcome with a probability that’s independent of other outcomes, like the toss of a coin). If we call the $k^{th}$ trial $x_k$, then the data $D$ is just all of those outcomes one after the other, and we have that

$P(D|HX)=P(x_1...x_n|HX)=p_1^{n_1}...p_m^{n_m}$

where the $n_k$ are the number of times the $k^{th}$ outcome was observed in those $m$ trials, and $p_k$ are the probabilities that a given trial had that outcome, with $\sum_kp_k=1$ and $\sum_kn_k=n$ being the total number of trials.

Now, given any observed sequence of outcomes $D$, the hypothesis that fits it best is the one that predicts the exact observed frequencies. To show this, let’s call the observed frequencies $f_k=\frac{n_k} n$. Using the fact that, on the positive real line, $\log(x)\geq(1-\frac 1 x)$ with equality iff x = 1, and choosing $x = \frac{n_k}{np_k}$, we have that

$\sum\limits_{k=1}^mn_k\log\left(\frac{n_k}{np_k}\right)\geq\sum\limits_{k=1}^mn_k\left(1-\frac{np_k}{n_k}\right)\geq 0$

with equality iff $p_k=f_k$, and with m being the number of possible outcomes. Why is this relevant? If we call $H^*$ the hypothesis of perfect fit:

\begin{aligned} \sum\limits_{k=1}^mn_k\log\left(\frac{n_k}{np_k}\right) &= \sum\limits_{k=1}^m(\log f_k^{n_k}-\log p_k^{n_k}) \\ &= \log\left(\prod\limits_{k=1}^mf_k^{n_k}\right)-\log\left(\prod\limits_{k=1}^mp_k^{n_k}\right) \\ & = \log P(D|H^*X) - \log P(D|HX) \end{aligned}

So for any hypothesis H, the above is 0 only when $H=H^*$, which means the hypothesis with the best possible fit in the Bernoulli class is exactly $H^*$.

Now, let’s bring $\psi$ back:

$\psi\equiv-10\log_{10}P(D|HX)+10\log_{10}P(D|H'X)$

But if you take $H'$ to be $H^*$, then that’s just the thing we just derived. So we have a refinement:

$\psi_B\equiv 10\sum\limits_{k=1}^mn_k\log_{10}\left(\frac{n_k}{np_k}\right)$

The closest this number is to zero, the better your hypothesis fits the data. And it’s stronger than the $\psi_{\infty}$, because it restricts its attention to the Bernoulli class, and perfect fit means that the observed frequencies were exactly the predicted ones.

Now let’s see an interesting thing. Define a quantity:

$\Delta_k=\frac{n_k-np_k}{np_k}$

With this we can find the following three identities:

$\sum\limits_{k=1}^mnp_k\Delta_k=\sum\limits_{k=1}^m\left(n_k-np_k\right)=\sum\limits_{k=1}^mn_k-n\sum\limits_{k=1}^mp_k = n-n = 0$

$1+\Delta_k=1+\frac{n_k-np_k}{np_k}=\frac{n_k}{np_k}$

$np_k\left(1+\Delta_k\right)=n_k$

And let’s take a look at our $\psi_B$ again. We can reexpress it using the last two identities:

$\psi_B = 10\sum\limits_{k=1}^mnp_k\left(1+\Delta_k\right)\log_{10}\left(1+\Delta_k\right)$

Using the Taylor expansion of the logarithm around 1, we have that $\ln(1+\Delta_k)=\Delta_k-\frac 1 2\Delta_k^2+\frac 1 3\Delta_k^3 \pm ...$. Note that this expansion is only valid when $\Delta_k$ is relatively small (it has to be between -1 and 1). We can use that:

\begin{aligned} \psi_B &= 10\sum\limits_{k=1}^mnp_k(1+\Delta_k)\log_{10}(1+\Delta_k) \\ &\approx 4.34\sum\limits_{k=1}^mnp_k(1+\Delta_k)\left(\Delta_k-\frac 1 2\Delta_k^2+\frac 1 3\Delta_k^3 \pm ...\right) \\ &= 4.34\sum\limits_{k=1}^mnp_k\left[\left(\Delta_k-\frac 1 2\Delta_k^2 + \frac 1 3 \Delta_k^3 \pm ...\right)+\left(\Delta_k^2 - \frac 1 2\Delta_k^3+\frac 1 3\Delta_k^4 \pm ... \right)\right] \\ &= 4.34\sum\limits_{k=1}^mnp_k\left(\Delta_k+\frac 1 2\Delta_k^2 - \frac 1 6\Delta_k^3 \pm ...\right) \\ & \approx 4.34\sum\limits_{k=1}^mnp_k\left(\Delta_k+\frac 1 2\Delta_k^2\right) \\ & = 4.34\sum\limits_{k=1}^mnp_k\Delta_k +2.17\sum\limits_{k=1}^mnp_k\Delta_k^2 \\ & = 2.17\sum\limits_{k=1}^mnp_k\Delta_k^2 \end{aligned}

The second line is because we’re taking the logarithm to base 10 instead of the natural logarithm so we have to divide everything by $\ln 10$, and the last equality is true because of our first identity. Now if you expand that last sum a bit:

$\sum\limits_{k=1}^mnp_k\Delta_k^2 = \sum\limits_{k=1}^mnp_k\left( \frac{n_k - np_k}{np_k} \right)^2 = \sum\limits_{k=1}^m\frac{(n_k-np_k)^2}{np_k}$

But wait. $n_k$ are just the observed outcomes $O_k$ and $np_k$ are the expected outcomes $E_k$ of the $\chi^2$ test! And this means that:

$\psi_B\approx 2.17\sum\limits_{k=1}^m\frac{(n_k-np_k)^2}{np_k} = 2.17 \sum\limits_{i=1}^m\frac{(O_i-E_i)^2}{E_i} \propto \chi^2$

Or, in other words, the surprising result is that $\psi_B \propto \chi^2$!

Well. Not really. See, what’s really the case is that $\chi^2$ is a second-order approximation to $\psi_B$ when the condition that all the $\Delta_k$ have modulus less than one is met. And predictably, when that condition is not met, we will find the $\chi^2$ test to be lacking.

Let’s get back to our tossed coin example. What are the associated $\psi_B$?

\begin{aligned}\psi_{B_E} &= 10\left[28\log_{10}\left(\frac{14}{29*0.499}\right) + \log_{10}\left(\frac{1}{28*0.002}\right)\right] \\ &= 8.34\text{dB} \\ \psi_{B_M} &= 10\left[28\log_{10}\left(\frac{14}{29*\frac 1 3}\right)+\log_{10}\left(\frac 1 {29*\frac 1 3}\right)\right] \\ &= 35.19\text{dB} \end{aligned}

The results agree much more nicely with our intuitions. What Ms. M finds out is that there is another hypothesis about the coin in the Bernoulli class that’s 35.2dB better than hers (that’s odds of over 3300:1), and that Mr. E’s hypothesis is better than hers by some 26.8dB and is only 8.34dB away from the best hypothesis in the Bernoulli class.

Why does this happen? Why is the $\chi^2$ test saying such outrageously different things?

There are two reasons. The first is that the squared part in the sum severely overpenalises outliers. The second is that then you divide that difference by the expected value, which also contributes to that. To see this, let’s suppose instead of observing an edge, we observed tails, so that there were 14 heads and 15 tails. In this case:

\begin{aligned} \psi_{B_E} = 0.30\text{dB} \ \ \ \ \ \ \ \ \chi_E^2 = 0.0925 \\ \psi_{B_M} = 51.2\text{dB} \ \ \ \ \ \ \ \ \chi_M^2 = 14.55 \end{aligned}

Now they agree. You see, under Mr. E’s hypothesis, he should’ve observed 14.471 heads, 14.471 tails, and 0.058 edges, and the $\chi^2$ amplified hugely that unexpected outcome of 1 edge.

A Bayesian test statistic talks only about comparisons between hypotheses, and Bayes would never tell you to reject a hypothesis unless it had a better alternative to offer. Furthermore, even then it wouldn’t say “reject it,” it would rather say “this hypothesis is a better fit for the data than that one,” and then you’d need to integrate your prior knowledge into it to know exactly what your posterior probabilities for the hypotheses ought to be.

This entry was posted in Mathematics, Probability Theory and tagged , , , , . Bookmark the permalink.

### 15 Responses to Orthodox test statistics and the absence of alternatives

1. The psi statistic is used in the orthodox likelihood-ratio test.

More precisely, in the jargon of orthodox hypothesis testing: When you test a simple null hypothesis against a composite alternative, and the null is a special case of the alternative, the test statistic used by the likelihood ratio test is psi.

See here for a definition of the likelihood ratio test: http://en.wikipedia.org/wiki/Likelihood_ratio_test#Definition_.28likelihood_ratio_test_for_composite_hypotheses.29

This is trivia, I guess. The point still stands: in the orthodox theory, the statistics don’t really come from anywhere, and the choice is arbitrary. Whereas, within Jaynes’s probability-as-logic theory, there is a strong reason to expect psi to represent fit, because it’s a measure of evidence.

The orthodox way to compare them, I guess, would be to compare the power of the tests. I found a power comparison, in a paper by Koehler & Larntz called “An Empirical Investigation of Goodness-of-Fit Statistics for Sparse Multinomials”.

Here, “Xk2” is the chi-squared statistic for k categories, and “Pearson test” refers to the same thing. And “Gk2” is the likelihood ratio test statistic. Tk is the vector of true probabilities.
“Monte Carlo power comparisons showed that, for the null hypothesis of symmetry, Xk2 is slightly more powerful for near alternatives. The Pearson test is decidedly dominant as the alternative moves toward a boundary of Tk that contains a high proportion of zeros and a few relatively large probabilities. The likelihood ratio test is dominant at alternatives that lie near boundaries of Tk that contain a small proportion of near-zero probabilities and have nearly equal probabilities in the remaining cells. This pattern agrees with observations made by West and Kempthorne (1971) from exact computations for two-, three-, and four-cell examples. The boundaries near which Gk2 is dominant become close to the symmetrical null hypothesis in the Euclidean sense as k becomes large, but the areas where Xk2 is more powerful may not. This indicates that Xk2 is more powerful than Gk2 for a very large portion of the simplex when k is moderately large.”

I was kind of surprised at first that psi aka Gk2 isn’t always more powerful, since psi makes more sense. But the weird thing about power analysis is that the power is a function of the true state of reality, in this case Tk. That’s weird because a criterion for judging a test should be a function of your information, right? “Closeness to ideal Bayesian inference” for example is a function of your information. So, a Bayesian test will be less powerful than another test in some possible worlds. Because, like, that other test could be a Bayesian test with extra information added in, and it’ll be more powerful in all the possible worlds where that information turns out to be true.

But, interestingly, they do find something that resembles what’s shown in this article. This article shows that, in situations where a cell probability is low, chi-squared is an obviously incorrect measure of fit. Whereas Koehler & Larntz noted that in exactly these situations, the chi-squared test is less powerful.

2. Will says:

I don’t think you actually understand what a sampling statistic is/what a chi^2 test is doing. If I have N data points I plant to draw independently from a probability distribution, then the joint probability distribution will get very large and unwieldy as N grows.

The idea of a sample statistic is to say ” can I create something that’s simpler to work with than these large joint probability distributions?” For instance, I could say “I’m going to look at the mean of my N data points,” and then any statistician would say “well, the mean is really just a sum of N random variables, so if N is very large, then I can formulate an equivalent problem using the central limit theorem. If you only care about the mean, you can say the mean is drawn from a one dimensional normal distribution, as long as N is very large.” In this case, we’d say the mean is the sample statistic, and the sampling distribution is the normal distribution.

You can construct all sorts of statistics from the data, and then you have to work to figure out what the sampling distribution is. Usually in doing so, you make some approximations. The chi^2 sampling distribution is built like this- and sometimes those approximations are bad. This has nothing to do with “frequentism” or “bayesianism” its just manipulating probability distributions.

BOTH BAYESIANS AND FREQUENTISTS DO THIS. Its very often easier for a Bayesian to replace conditional distributions of unknown parameters given the data by conditional distributions of unknown parameters given some sample statistics/using the sampling distributions.

Your problem is a case where the approximations requiring the chi^2 statistic is not well approximated as being distributed via a chi^2 distribution, because of the low expected cell count. If instead you build the sampling distribution via simulation, you can get ok results in this case. (there is another issue that creeps up if you have lots of cells with low counts).

You should be careful to separate the pure math of probability theory from the philosophy (frequentist vs. bayesian) that you seem to want to talk about.

• pedromvilar says:

Oh, I know, and you’re absolutely right! My problem with frequentism, though, and that I was trying to exemplify with this post, isn’t the use of chi^2 or other test statistics per se – I’ve shown at the end that chi^2 is a second-order approximation to the log-likelihood. Rather there are two large problems with frequentist methods: one is that they don’t always use methods that are actually perfectly derivable from first principles, and use a lot of ad hoc tools and intuitions; the other is that even when they do, they frequently fail to acknowledge exactly why the method they’re using works and what it’s approximating, etc (also they scorn prior knowledge and pretend it’s not there while it’s very clearly there in the choice of model).

What I was trying to show here was exactly the philosophy difference, which is one of tools vs. laws: frequentism sees its methods as tools to measure something, whereas Bayesianism sees its methods as optimal laws of inference whose violation necessarily implies a misapplication or misuse of cogent information. I was also trying to explain why the concept of a p-value is misguided in judging hypotheses, which is another failure of frequentism.

• Will says:

Yes, you’ve shown chi^2 can approximate the log likelihood. You haven’t shown that the chi^2 statistic’s sampling distribution is the chi^2 distribution (unfortunately, both use the same name in this case). In this case, the core problem is that the chi^2 statistic does not follow a chi^2 sampling distribution, which is why the test gives the wrong answer.

But these methods aren’t ad-hoc in the way you are suggesting- there are always derivations starting from the probability distribution that lead to the sampling distribution, and usually the convergence properties are very well studied.

Sure, they can be used in an ad-hoc way in real-world problems, the same way priors are used in ad-hoc ways in real problems. Real problems are hard.

The frequentist is interested in “If I take this boring model, how likely is it to produce data at least this extreme?” Everything else about the test statistics, etc is just manipulation of probability distributions to make that particular problem tractable. The goal here is to reject the boring model, so that you can say “I’m seeing something thats not boring. I don’t know what yet.” Picking a specific p-value that means “worth paying attention to” is an arbitrary way to do that, sure, thats a valid criticism.

The Bayesian is interested in “if I take this class of models, which is the most likely to produce the data I’m seeing?” Knowing that, they can update their prior, whatever it may be. They ALSO use a lot of tricks to make the problem tractable, but usually the tricks are pretty much the same as frequentist tricks because they tend to look at the same classes of probability distributions.

Which question to ask, the frequentist one or the Bayesian one, is a philosophical question. The frequentist question is usually easier to answer, and sometimes its a very useful question (think particle physics – what is usually most important is to reject the standard model, not to figure out right away exactly which alternative you are looking at).

• pedromvilar says:

I don’t think I was suggesting that they come from literally nothing, and I apologise if it sounded that way. I may have been hyperbolic in the original post xP And yes, I know that priors are used in ad hoc ways just like frequentist tools in the real world. What I’m trying to point out here is a difference between, shall we say, ideal-world reasoning and frequentist reasoning. And the fact is that statistics books I’ve read and classes I’ve taken sorta don’t assume that there’s any unifying principle behind probability, and basically just throw a bunch of possible models at me and tell me that my intuition will take care of choosing the right one; the Bayesian ones are at least ashamed of it 😛

And about particle physics: I disagree! I think it’s much more important to figure out which alternative I’m looking at than rejecting the standard model. Or, well… I guess I mostly mean that saying “reject the standard model” doesn’t actually tell me anything. I’m not any better than I was before, and I’m still going to use the standard model until I find something better. So, basically, I’m trying to defend the Bayesian question as the right one.

3. Will says:

How are you defining “ideal world” reasoning? I feel like you originally came came to the conclusion that frequentism wasn’t ideal because you erroneously believed that frequentist tests weren’t derivable from first principles (they are!), and now you are retreating to this idea that frequentism is not ideal because it asks a different question than Bayes. What makes the Bayes question better, or “right”?

It seems silly to me to say “this question is better.” Why shouldn’t I ask all the questions that I can formulate? Nothing about probability theory itself tells me what questions is “best”, and even in pure-math contexts, there are known problems where Bayesian methods are inefficient (in a mean-square-error sense), and known problems where Bayes fails to converge to the right answer(see Diaconis and Freedman’s famous paper which kicked off all the non-parametric Bayes stuff).

There are also messy real-world complication- there are lots of cases where the frequentist question is easy to answer but the Bayesian question is impossible to formulate. This isn’t an approximation to Bayes, its reformulating a new question that can be answered.

Even in real-world problems that seem ideal for Bayes, Bayes sometimes isn’t enough and you need to generalize it and go to Dempster-Shafer to get anything sensible. Again, this isn’t an approximation to Bayes, its generalizing it so it can handle conflicting data in a smoother way.

In regards to particle physics- the problem is that the standard model is very well developed in a way that alternative models are not, so its not all clear what to compare to the standard model (is it fair to compare a fixed parameter model to something with multiple floating parameters? Obviously the maximum likelihood floating parameter models will all appear more likely). So what particle physicists want to do is collect observations that don’t fit the standard model and use them as starting points to develop new models. Before you can test an alternative model, you need the data to formulate it. You are suggesting that its not worth knowing where your current model fails if you don’t have an alternative- but if you aren’t careful about finding out where it fails, how do you develop the alternative?

• pedromvilar says:

Whoa whoa, hang on a second there. They *are* derivable from first principles? That’s new information. I know they’re not derivable from literally nowhere, but I don’t actually know that they’re all derivable from a single unifying set of first principles. What are those?

I say “this question is better” in the sense that it actually tells me something. Frequentism may tell me where to look, but Bayesianism *also* does that. Frequentism doesn’t tell me what to do once I’m looking, at least not as firmly as Bayesianism does (or at any rate, as it does once I’ve arbitrarily set my priors). Also, mean-square-error is also arbitrary x)

As for the real-world complication, yeah, I’m very aware of it. But again, my problem here is the failure of acknowledgement. It may be the case that professional statisticians or philosophers or whatever know that the frequentist question yak yak, but that knowledge is most definitely *not* consistently passed on to the people who go to university courses/read books about statistics. *Those* people just learn mindless tools they’re supposed to throw at problems to see if they’ll go away.

As for particle physics, I’m not in any way suggesting that it’s not worth knowing where my current model fails without an alternative. But this “knowing where it fails” isn’t, as far as I can tell, something exclusive to frequentism. You can very well look at a set of data that’s too extreme for your tastes, conclude that something’s off, and go looking for it; the problem is that when you turn that into a law and create p-values and teach them to poor confused undergrads in universities and textbooks and tell them That’s The Way It Is, you make it look Official and Correct as opposed to A Way To Try To Heed Our Intuition.

Anecdotal evidence: friend of mine working on his Master’s degree in biology asked me for help with the statistics part, we collected the data, made a graph, a report on the results, and then when we showed the Ph.D.s that were overseeing the project, they said they had no idea what the statistics meant except it looked pretty and had a nice p-value.

Teach the controversy! 😛

• Will says:

Once you define what probability distribution you are looking at, everything else is just calculus/manipulation of distributions. So if you have a a gaussian variable X, you can show that the sample variance follows a chi^2 distribution, that the t-statistic follows a t distribution,etc. There is no frequentism vs Bayes there- it’s just “if this variable is gaussian (or whatever), how is this function of the random variable distributed?” It’s all derivable from just calculus + however your underlying variables are distributed.

If those phd students working with data don’t understand statistics, their adviser should be shot.

But the reason that “recipes” are often taught to undergrads is that undergrads often demand recipes- and take courses to teach them those recipes. And we try to teach stats without calculus which is stupid.

• pedromvilar says:

Once you define what probability distribution you are looking at

How do you define what probability distribution you’re looking at?

Anyway, sure the sample variance may follow a chi^2 distribution or whatever; still doesn’t justify the existence of arbitrary p-values and student’s t tables with rejection values and whatnot, which are completely arbitrary, and not explicitly so (in the sense that people use them and reach “conclusions” without modalising them with the confidence (or lack thereof) that should be attached to such arbitrary number).

If those phd students working with data don’t understand statistics, their adviser should be shot.

I dunno about you, but in my experience pretty much no one, adviser or student or researcher, that writes a paper understand any statistics beyond “use this or that tool/recipe.”

• Will says:

You define probability distributions based on the problem. The same way as likelihoods in Bayesian methods.
And sure, which p-value you should use as a ‘reject’ cut off are going to be somewhat arbitrary, but so are priors. No inference method is perfect. Student’s t-distribution tables aren’t arbitrary at all though, they are just tabulations of values of the t distribution.
Have you taken an actual statistics class? A lot of your confusion seems to be over what the usual statistical tests are doing.

• pedromvilar says:

Priors aren’t completely arbitrary, but that’s beyond the point, which I mentioned earlier as being the fact that it’s not acknowledged that there’s nothing special about p-values. That a frequent claim against Bayesianism is that it’s subjective and arbitrary whereas frequentism is objective in some sense, which is false.

And I’m basing my arguments on my actual statistic classes, which have been exactly this – Professor teaches a tool, gives me a problem to use that tool in, I solve the problem, I get a mark – and on papers I have read, which pick-and-choose their methods and use arbitrary p-values to define when they should accept or reject a hypothesis (with differing values even in the same papers).

4. Will says:

Then I apologize for statistics classes that are not serving you well. I’d recommend Feller’s An Introduction to Probability Theory and Its Applications.

• pedromvilar says:

Thank you! I’m also planning on reading Kendall’s vol. 2a and 2b, what do you think of them?

• Will says:

I’ve only perused them as a reference, never tried to learn from them.