Truth, Probability, and Unachievable Consistency

What is truth?

Many an author has written lengthy philosophical treatises that begin with exactly this question, but, however shaky my identification with the group may be, as a rationalist my first and foremost answer to that question – or my first and foremost interpretation of that question – is a practical one. And to help with it, let’s first ask rather how we identify truth.

Whatever metaphysical definition you might be going with, whether you’re a Tegmark-Level-IV mathematical realist or some form of solipsist, if you ask someone the colour of the sky on a clear cloudless sunny day, they’d probably answer it’s “blue” – unless, of course, you started with a big preamble about “what truth is” or made the conversation seem any more than just a question, in which case they might go off on a philosophical tangent. But for the purposes of this post, let’s assume that’s not what’s happening, and they’ll just answer you it’s blue. If that’s a problem, maybe ask them another question, one that isn’t so obviously tied up with philosophical conundrums, such as “Who is the current President of the United States?” (it’s Barack Obama) or “What’s Brazil’s official language?” (it’s Portuguese).

Regardless of anything else, to a broad first approximation and to all intents and purposes that come up in daily life, we can say that the above answers are true. It’s true that, today, Obama is President of the United States, we speak Portuguese in Brazil, and the sky is blue on a clear sunny day. It’s true that if you walk off a cliff you will fall to your death. So far so good, I am fairly certain there is nothing controversial above these claims.

In practice, we recognise truth in a somewhat positive way. Which is not to take the extreme position that only empirical claims are “cognitively meaningful”; I’m a moral non-realist yet I see meaning in sentences such as “it is (ceteris paribus) wrong to murder people” even if there is no clear or direct empirical verification of the “wrongness” predicate (I mostly see predicates like “wrong” as two-subject predicates, one being the action itself and another being a given moral theory).

But in any case, as I was saying, in practice we use the concept of truth in a positive way. We propose truth based on evidence, and we defend truth based on expectation. A hypothesis is true if, of all mutually exclusive hypotheses, it leads us to expect reality, if it gives us predictions that turn out to be true, if it gives the most probability to what actually happened and will actually happen, if believing it’s true causes you to be less surprised about what you see than otherwise.

This may not be an immediately intuitive definition of “truth.” It’s almost certainly not the first thing most people think of, when they think of truth. But I think it sounds like a reasonable description. If, of literally all possible hypotheses, you have a given one that predicts your observations best, then you probably use that.

Except… not quite? Let’s talk probability (of course).

We’ll take a simple toy model: an urn, with black and white balls in it. You can only draw one ball at a time, and put it in a line, which you can always look at. Except it’s a magic urn: its contents aren’t necessarily fully determined a priori, and the urn might give you a different ball depending on what your line of previously drawn balls looks like. Let’s assume without loss of generality that the urn uses a single rule to determine what ball it gives you, always.

The urn and its rule are reality, the possible rules you think of are your hypotheses about it, and the balls are observations, if the metaphor wasn’t clear. And even further, this example is actually isomorphic to reality, though the proof is left as an exercise to the reader.

Before you start drawing from the urn, your line is absolutely empty. You have no idea what the rule may be, you’re maximally ignorant about reality, you have no reason to expect your first ball to be one colour or the other. The hypothesis “all balls are white” is as probable as “all balls are black,” the hypothesis “X\% of balls are white” is in fact as probable as the hypothesis “Y\% of balls are white” for any X and Y. The true rule might be “the first ball is black, all others are white,” but it might be “the first ball is white, all others are black.” A priori, P(\text B) = 0.5.

Now suppose you’ve observed the sequence \text{BWBWBWBWBWBWB}. Intuitively, I think it’s reasonable to expect that the next ball will be white. For the hypothesis H_0: “balls alternate between black and white,” the probability that the next ball will be white is exactly 1. In fact, the probability that we would observe exactly this sequence in the first 13 draws is 1, which happens to be the maximum number a probability can be, so H_0 obeys that intuitive condition for truth.

But wait! What about the hypothesis H_1: “balls alternate between black and white, except the 14^{th} ball is black”? It also gives probability 1 to that first sequence, but the probability that the next ball will be white is 0.

Clearly the definition is lacking. In fact, in hindsight it’s quite obvious why: for any finite sequence of observations, there is a hypothesis that says that sequence had to have been observed. Just giving a very high probability to our observations is not a sufficient condition for truth. In fact, if the true rule happens to be H_2: “there is always an exactly 50\% chance that a ball will be white,” that hypothesis won’t even give our observations the highest probability of the bunch – a meager 2^{-13} compared to the confident H_0 and H_1.

“But wait,” you cry, “surely after enough balls have been drawn we will find the true rule?”

That’s closer to the mark. We do, after all, need to reconcile the intuitive notion that the truth is exactly that which gives us the best predictions with the fact that any number of hypotheses can also boast that claim.

It’s still not quite correct, though. Suppose we’re comparing hypothesis H_2, which I am now telling you from outside this toy problem is the correct one, to hypothesis H_3: “the 13 first draws will deterministically alternate between black and white, then all further draws will have an exactly 50\% chance of black and white.” Not only will this hypothesis fit all observed data forever, the likelihood of any valid line (given we’ve already observed the above list) will be 2^{13} times greater than under the true hypothesis H_2. What right do I have to call H_2 true, then, as opposed to H_3?

To be quite honest, very little. The practical difference is, of course, zero: once I’ve already observed those 13 first draws, it’s all the same to me. But let’s suppose I have a philosophical bone to pick with this indifference. I want to know what’s really real goddamnit.

Then you’ll have to excuse my using a philosophical argument. I simply prefer simpler hypotheses, as a matter of consistency and personal taste. The principle of indifference applies to H_3: there are 2^{13} possible rules of the form “the 13 first draws are like thus, then every subsequent draw is random,” with no a priori reason to believe any one of them is more likely than any other, so if they’re all equally likely, then a priori H_3 is at least 2^{13} times less likely than H_2, and it all balances out (it’s actually less likely than even that because H_3 says the first 13 draws are deterministic so that’s also extra information).

Truth, then, seems to be built out of three things: good predictions, long-term validation, and a philosophical preference for simplicity.

The last of these conditions is about how to build your a priori beliefs, which by a Bayesian are called prior probabilities (or just priors), but from the practical standpoint I want to adopt, I don’t really care whether H_2 or H_3 is true, because on compatible sets of observations, the likelihood ratio between these two hypotheses will be 1 and it will be impossible even in principle to differentiate them. Henceforth, I’ll just reduce my hypothesis-space and consider all hypotheses that fit my observed data equally well and make exactly the same predictions from now on as one hypothesis. I’ll call the remaining hypotheses post-reduction hypotheses.

The other two conditions are a bit more interesting to me. Can I guarantee that I will eventually arrive at the true hypothesis? If I collect enough data, am I going to necessarily believe the truth with high confidence? In more precise terms, will a large enough number of observations render my confidence in the true hypothesis independent of what I believed about it a priori?

By our threefold definition of truth, yes. Let’s look at the posterior odds ratio between any two given hypotheses. If we call y our line of balls drawn and X our background knowledge:

\frac{P(H_i|y, X)}{P(H_j|y, X)} = \frac{P(y|H_i, X)}{P(y|H_j, X)} \frac{P(H_i|X)}{P(H_j|X)}

The prior odds, \frac{P(H_i|X)}{P(H_j|X)}, are a constant that represent how likely we believe one hypothesis is in relation to another before we draw any balls from the urn. That’s where the simplicity condition of Occam’s Razor is encoded.

Then, we know that likelihood P(y|H,X) is decreasing in n. That’s because each new observation multiplies its previous value by some number between 0 and 1. So, as n gets arbitrarily large, the likelihood ratio \frac{P(y|H_i, X)}{P(y|H_j, x)} can either get arbitrarily large, get arbitrarily small, or stay bounded (maybe after growing and/or shrinking for the first m draws).

If it stays bounded, then we’re looking at a situation such as the one described above between H_2 and H_3, and we can just conflate these hypotheses as a single thing. So let’s only look at genuinely distinct hypotheses, the post-reduction ones.

Given that we’ve defined a true hypothesis as one that asymptotically gives high probability to our actual observations, as we draw more and more balls, the likelihood ratio will get arbitrarily large if the true hypothesis is in the numerator. Therefore, regardless of what our prior for it was, as long as it was not exactly 0, we will end up believing the true rule with very high confidence. And since we all know that 0 and 1 aren’t probabilities anyway (a condition called Cromwell’s rule), any reasonable prior beliefs will be washed out by enough evidence.

(A corollary is that an unreasonable prior might take arbitrarily long to be washed out. But such is life.)

Bayesianism isn’t universally accepted. This may come as no surprise to you. There exist some theoretical objections to it – from the murkiness of the “common sense” axiom to the notion that we’d need only one number to represent uncertainty -, but it seems to me that they’re mistaken. Philosophically, Bayesianism looks pretty sound.

However, in practice

For any prior probability that follows Cromwell’s rule, after enough observations have been made the truth will come out. Nice and dandy. But there are two pesky problems with this: one, there’s absolutely no way to in principle know how many observations you’ll need to make in order to find out the truth; two, in reality using a prior that doesn’t violate Cromwell’s rule is not in fact too feasible.

Every good Bayesian doing a numerical estimate would say a string of probabilities and then end with “…and a 5% (or 1%, or 0.1%) probability it’s something I haven’t thought of.” This is meant to cover, well, everything they haven’t thought of. Except there’s an infinity of hypotheses they haven’t thought of! And without knowing them, without explicitly working them out, we cannot, in fact, discover they’re true.

Case in point: General Relativity. When Einstein suggested it, the evidence was overwhelmingly in favour of it. It beat every other alternative by such a large margin that the physicist uttered his (in?)famous phrase, “Then I would feel sorry for the good Lord. The theory is correct anyway,” when asked what he would’ve done if Sir Arthur Eddington’s expeditions failed to confirm his predictions. But before anyone had suggested G.R., it was there, inside the “5% probability it’s something I haven’t thought of.” It would’ve remained there until someone suggested it.

Bayesianism is not a good model of how we actually do science. Solomonoff Induction is literally uncomputable, and to the extent it’s the optimal way of performing inference, the problem of induction is unsolvable in practice. Bayes’ Theorem is consistent, aye, but only God and Laplace’s Demon can do it.

In practice…

(The whole recurring theme of this post was practice.)

The strength of a hypothesis can only be measured against other hypotheses. Bayesianism is not even an approximately good model of how we actually do science, even if it’s a good ideal for what a perfect uncomputable God would do. Approximating Bayesian inference is not guaranteed to give us an approximately optimal answer. Adding more hypotheses to the mix does not necessarily make our final answer proportionally more accurate. The goal’s unreachable, and the path to it is broken, full of steps, turns, cliffs, and non-Euclidean loops.

In practice, Bayesianism can only be good to figure out which amongst the hypotheses considered is backed up by the evidence. But it can’t tell you which hypothesis is actually true.

This entry was posted in Mathematics, Probability Theory, Rationality and tagged , , , , , , . Bookmark the permalink.

3 Responses to Truth, Probability, and Unachievable Consistency

  1. 1Z says:

    “Case in point: General Relativity. When Einstein suggested it, the evidence was overwhelmingly in favour of it. It beat every other alternative by such a large margin that the physicist uttered his (in?)famous phrase, “Then I would feel sorry for the good Lord. The theory is correct anyway,” when asked what he would’ve done if Sir Arthur Eddington’s expeditions failed to confirm his predictions. But before anyone had suggested G.R., it was there, inside the “5% probability it’s something I haven’t thought of.” It would’ve remained there until someone suggested it.”

    Well, no there exactly one piece of evidence GR was able to explain on its introduction.

    “At its introduction in 1915, the general theory of relativity did not have a solid empirical foundation. It was known that it correctly accounted for the “anomalous” precession of the perihelion of Mercury and on philosophical grounds it was considered satisfying that it was able to unify Newton’s law of universal gravitation with special relativity. That light appeared to bend in gravitational fields in line with the predictions of general relativity was found in 1919 but it was not until a program of precision tests was started in 1959 that the various predictions of general relativity were tested to any further degree of accuracy in the weak gravitational field limit, severely limiting possible deviations from the theory”

    The “Good Lord” comment probably relates to the aesthetics of the theory.

    • “…on philosophical grounds it was considered satisfying that it was able to unify Newton’s law of universal gravitation with special relativity.”
      This is a lot of evidence in favor of the theory, though. It explained all the empirical observations that Newtonian gravity explained, plus all the observations that special relativity explained.

      You can call unification a philosophical or aesthetic goal. But you’ve got to recognize that coming up with simple unifying theories works, and the reason it works is that such theories have huge bodies of evidence for them upon their introduction–the sum of all the evidence from the theories the unify.

  2. Pingback: Stopping rules, p-values, and the likelihood principle | An Aspiring Rationalist's Ramble

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s