## Occam’s Razor

Numquam ponenda est pluralitas sine necessitate.
William of Ockham (c. 1287-1347)

The more famous sentence attributed to William of Ockham, “Entia non sunt multiplicanda praeter necessitatem,” translated as “Entities must not be multiplied beyond necessity,” is absent from his works, and owes more to John Punch (1603-1661). Ockham himself only said the quoted sentence, “Plurality must never be posited without necessity,” and the philosophy of doing away with unnecessary details is a characteristic of his work, not exactly a specific sentence he ever uttered. And then, this is only a sort of philosophical/intuitive stance, without a specific clear prescriptive definition. What is plurality? What does that have to do with anything? What is this blasted “razor” and what is it cutting?

More modern phrasings of the razor say that “[t]he simplest explanation that fits the facts” must be the true one. As the W puts it, “[i]t states that among competing hypotheses, the one with the fewest assumptions should be selected. Other, more complicated solutions may ultimately prove correct, but—in the absence of certainty—the fewer assumptions that are made, the better.” Okay, that’s a little bit better than “entities must not be multiplied” or “plurality must never be posited,” but it still only gives me a more-or-less intuitive and not precise grasp on it. Which is totally fine, by the way! Most of the time, you’re going to use Occam’s Razor in a strictly intuitive way, simply because it’s fairly obvious: if you look at your socks and notice they’re white, the intuitively simpler conclusion is that your socks are in fact white, and not that an alien is messing with your head and making you see your blue socks as white. The obvious answer isn’t always the correct one, but by golly, sometimes it is!

But things aren’t always this clear-cut, and I also have the problem that I want to design an AI that reasons correctly, which means that having “intuitive grasps” on things isn’t going to cut it (pun intended). And in the end, as Robert Heinlein once pointed out, the simplest explanation is always “The lady down the street is a witch; she did it.” And I hope that you won’t start positing witches, now, thinking that Occam’s Razor justifies it!

So, as Yudkowsky put it, “One observes that the length of an English sentence is not a good way to measure ‘complexity’. And ‘fitting’ the facts by merely failing to prohibit them is insufficient.” And as he explained in the linked article, the reason why the length of an English sentence is not a good way to measure “complexity” is because in English we use many labels for complex concepts. “Witch” is a vastly complex entity that requires so much cultural background that the only reason we have a word for it is because… well, many cultures share that background! It’s common to find tales of superpowered humans that cast spells and sorts on people all around the world, so it doesn’t feel that complicated because that’s an easily available concept.

Furthermore, the word “it” itself hides a lot of complexity. Look at the following sequences: 010101010101010101010101010101010101010101 and 00101000101010101011010. The former has 42 characters, the latter has 23. Which one is the “simplest”?

Clearly the first one. It’s just “01 repeated 21 times.” Even that English sentence has only 20 characters (21 if you count the full stop at the end), which makes it shorter than the second sequence. And does the second sequence have any compression like that? Not as far as I know, since I just pressed the numbers 0 and 1 randomly. The first sequence, in spite of being longer, carries less information than the second one. The second, in order to be expressed, cannot be compressed, it has to be reproduced entirely. To transmit that message to anyone you need to send the entire string. The same thing happens with “it” in the sentence “a witch did it”: unless the person you’re communicating with already knows what “it” is, you have to explain it to them, and so your entire message is strictly larger than just describing what happened.

And even intuition isn’t good enough sometimes. Intuitively, a hunter-gatherer who compares Thor the god of thunder with the Maxwell Equations will normally think Thor is much simpler. No complicated mathematical stuff. But that’s because, to a human, the entire complexity of the human brain, human emotions like anger and jealousy and love, that’s all hidden, that’s taken for granted. It’s easy and available to posit an agent feeling human emotions because to us, feeling emotions is simple. This still goes on in many debates where the Creationist will say that God is the simplest hypothesis there is “by definition.” That person lacks any consistent, formal, and precise definition of what “simple” and “complex” are, and is treating their own brain like a black box.

But then we arrive at an important point: what is the consistent, formal, and precise definition of “simple” and “complex”?

There are a few mathematically equivalent formulations of Occam’s Razor. The one preferred by AI researchers and people who reason with Bayes in general is called Solomonoff Induction. In it, you have a list of all possible computer programs, and to each of them you assign a prior probability proportional to the exponential of its length (I’ll explain this in a bit). These computer programs are programs that describe the universe, and with it, your observations. Then, as you make observations, you use Bayes’ Theorem to update and find which program is the most likely to correspond to reality. Because of the prior probability assignment, the shortest program that predicted all your observations will be the most likely one once you make them. Furthermore, the complexity/size of the program doesn’t actually change if you switch computers or languages: the Invariance Theorem guarantees that changing languages/computers changes the length of a program description at most by an additive constant term that depends on the languages and not on the programs.

Without loss of generality, every computer program can be specified as a binary string, a sequence of 0s and 1s – this is true because every computer program can be interpreted by a Universal Turing Machine and written as such a string, and as I mentioned above, every universal computer/language is equivalent to every other up to an additive constant that depends on the languages and not on the programs. And for any fixed length $L$, there are exactly $2^L$ binary strings of that length. For example, there are $2^3 = 8$ binary strings of length $3$: 000, 001, 010, 011, 100, 101, 110, 111. And since we’re trying to find something of a “universal prior distribution” over all possible computable hypotheses, we still haven’t collected any evidence to distinguish between these many different programs. So the principle of indifference applies: all programs of a given length $L$ have probability proportional to $2^{-L}$.

Now, why doesn’t this indifference argument make all possible programs equally likely? Why should shorter programs be more likely than longer ones?

The first and bad reason is a practical one: if you do that, your probabilities won’t converge to 1, so we choose a measure that we know will. However, that just begs the question of why choosing this measure instead of any other arbitrary one that could also be made to converge to 1.

Which brings us to the second reason: every bit you add to the length makes there be twice as many programs with that length, so it follows somewhat intuitively that they should be half as likely. And that’s because, well, while you ought to be indifferent between hypotheses 010 and 001, you shouldn’t be indifferent between hypotheses 010 and 0100: the latter has more information than the former. Every bit you add to a hypothesis makes it contain more information, so even a priori you’re not indifferent between programs of length $L$ and programs of length $L+1$. Between all programs of any given length $L$am in fact indifferent: for any Universal Turing Machine that runs program A when input with 001 and program B when input with 000, I can design a Universal Turing Machine that does the opposite. So, for any given length $L$, the actual string that determines the programs of that length is arbitrary, which makes them all interchangeable a priori, which means they should all have the same probability; on the other hand, it is in general not possible to exchange the binary string describing a program of length $L$ with the one that describes a program of length $M \neq L$ with Universal Turing Machines, so you lose the symmetry that allows the indifference argument to work.

(And you might be thinking: how do you normalise it? If there are $2^L$ programs of size $L$ each with probability $2^{-L}$ then all programs of size $L$ have together probability 1. That would be right, except that Turing Machine encodings are prefix-free, that is, if 011 is a program, 0110 and 0111 cannot be because 011 cannot be used as a prefix for any other programs. I will explain this better in a future post about Universal Turing Machines.)

Alright. So, why is this definition any better than any other? Why should we care about computer programs in some abstract machine?

Because there is a bijection between hypotheses and programs. Every hypothesis specifies a program that runs it, every program runs a hypothesis about the world. And hypotheses also obey the basic mathematical principle that every bit you add to a description makes there be twice as many hypotheses with that description size. Every bit in a description is another yes-or-no question that you answer about a hypothesis, and a 20-bit hypothesis is one that needs 20 yes-or-no questions to be answered in order to be specified. The principle of indifference applies here, because for every yes-or-no question you invent you could just as well have added a no-or-yes inverted question; what the actual string is doesn’t matter, so strings of the same size are equiprobable. But strings of different sizes answer different amounts of yes-or-no questions, so longer strings contain more information about the world, so need more evidence to be singled out.

So, you see, the complexity argument isn’t one that comes from simple intuition: it’s mathematical and precise. And when you have two hypotheses $H_0$ and $H_1$ that explain the observed data equally well (that is, $P(D|H_0X) = P(D|H_1X)$), what determines which hypothesis is favoured is its prior, according to Bayes’ Theorem:

$\frac{P(H_0|DX)}{P(H_1|DX)} = \frac{\frac{P(D|H_0X)P(H_0|X)}{P(D|X)}}{\frac{P(D|H_1X)P(H_1|X)}{P(D|X)}} = \frac{P(H_0|X)}{P(H_1|X)}$

That is, the relative odds between two hypotheses that explain equally well the observations remains unchanged and is determined by the prior. Under an Occamian prior that has been outlined above, the simplest hypothesis wins.