## Other ways of looking at probability

Bayes’ Theorem is all nice and dandy, but it may not necessarily be the best thing to work with. It’s quite simple:

$P(A|BX) = \frac{P(B|AX)}{P(B|X)}P(A|X)$

When laid out this way, what it says is that the probability of some proposition $A$ after you learn that $B$ is true equals its probability before you knew it was true times a factor we could call the impact of $B$ on $A$:

$\text{Impact}(A;B) = \frac{P(B|AX)}{P(B|X)}$

When explaining what the Bayesian meaning of evidence was, I mentioned the obvious thing that if $B$ is more likely when $A$ is true, then it’s evidence for it, and it’s evidence against it otherwise.

However, we can look at it another way. If we expand $P(B|X)$ as explainable by a large set of mutually exclusive propositions $\{A_0, A_1, A_2, ...\}$, then we can reexpress the theorem as

$P(A_0|BX) = \frac{P(B|A_0X)P(A_0|X)}{\sum_{i}P(B|A_iX)P(A_i|X)}$

When you look at it like that, then the denominator becomes nothing but a normalising factor meant to guarantee that your probabilities sum to 1, and it’s $P(B|A_0X)$ which does all the hard work. And then that term gets two different names depending on which way you look at it.

To explain it, let’s suppose $A$ is some hypothesis we’re studying and $B$ is the data collected. Let’s rename them $H$ and $D$ accordingly. Rewriting the theorem:

$P(H_0|DX) = \frac{P(D|H_0X)P(H_0|X)}{\sum_iP(D|H_iX)P(H_i|X)}$

The term $P(D|H_iX)$, when considered as a function of the data $D$ for a fixed hypothesis $H_i$, is generally called the sampling distribution. For example, suppose the data is a string of results of a binary experiments and the hypothesis is that these results are in fact Bernoulli trials – which means that the probability that each trial will come out one way or another depends only on the specific way it can turn out, and not on previous trials. An example would be the tossing of a coin, and a possible hypothesis is that the coin turns up heads with probability $p_i$ and tails with probability $q_i = 1 - p_i$. Then, if we suppose that data $D$ consists of a series of outcomes – for instance, HTTHH would mean there was one head, followed by two tails, followed by two heads -, the probability that that data $D$ would be observed under hypothesis $H_i$ is $P(D|H_iX) = p_i^{n_H}q_i^{n_T}$ where $n_H$ is the number of observed heads and $n_T$ is the number of observed tails. Another way to write it would be $P(D|H_iX) = p_i^{n}(1-p_i)^{N-n}$ where $n$ is the number of heads and $N$ is the total number of tosses. In that, then, for any given hypothesis $H_i$, the sampling distribution is a function $f(D;H_i) = p_i^{n}(1-p_i)^{N-n}$ of $n$ and $N$ while holding $p_i$ fixed.

However, if you look at $P(D|H_iX)$ as a function of the hypothesis for some fixed dataset $D$, then it’s called the likelihood $\mathscr L(H_i;D)$, and although it’s numerically equal to the sampling distribution, it’s a function of the parametre space $\{p_0, p_1, p_2, ...\}$ while holding $n$ and $N$ fixed. Unlike the sampling distribution, it’s not seen as a probability, but rather a numerical function that, when multiplied by some prior and a normalisation factor, becomes a probability. Because of that, constant factors are irrelevant, and any function $\mathscr L(H_i;D) = y(D)P(D|H_iX)$ is equally deserving of being the likelihood, where $y(D)$ is a function exclusively of the data and independent of the hypotheses under consideration.

Now, if you take the ratio of $P(H|DX)$ and $P(\bar H|DX)$ you get

$\frac{P(H|DX)}{P(\bar H|DX)}=\frac{P(H|X)}{P(\bar H|X)}\frac{P(D|HX)}{P( D|\bar HX)}$

and the prior for $D$ drops out. We call that ratio the odds on the proposition $H$, $O(H|DX)\equiv\frac{P(H|DX)}{P(\bar H|DX)}$, and combining both equations we have:

$O(H|DX) = O(H|X)\frac{P(D|HX)}{P(D|\bar HX)}$

This form has a very nice intuitive meaning, and it’s better for calculating Bayesian updates. In that, if something has probability 0.5, then it has odds 0.5:0.5 or 1:1 (read one-to-one). If something has probability 0.9, then it has odds 9:1 (nine-to-one) and we know immediately that it’s nine times more likely to be true than to be false. Now, since $P(A) = 1-P(\bar A)$, the odds transformation is just $O = \frac{P}{1-P}$, and if I have the odds, I can transform it back to probabilities by using $P = \frac{O}{1+O}$.

And that last term is called the likelihood ratio. Using Yudkowsky’s example:

Let’s say that I roll a six-sided die: If any face except 1 comes up, there’s a 10% chance of hearing a bell, but if the face 1 comes up, there’s a 20% chance of hearing the bell. Now I roll the die, and hear a bell. What are the odds that the face showing is 1? Well, the prior odds are 1:5 (corresponding to the real number 1/5 = 0.20) and the likelihood ratio is 0.2:0.1 (corresponding to the real number 2) and I can just multiply these two together to get the posterior odds 2:5 (corresponding to the real number 2/5 or 0.40). Then I convert back into a probability, if I like, and get (0.4 / 1.4) = 2/7 = ~29%.

Furthermore, if you have more than one hypothesis at stake – suppose if face 1 comes up there’s a 20% chance of hearing the bell, if face 2 comes up there’s a 5% chance of hearing the bell, and if any other face does there’s a 10% chance of hearing the bell -, then you can use extended odds to calculate your posteriors. In this case, before you throw the die, the prior odds are 1:1:4, and the extended likelihood ratio for hearing a bell is 0.2:0.05:0.1 or 4:1:2. If you throw a die and hear a bell, your posterior odds will be 4:1:8, and your posterior probabilities will be 4/13 = 30.77% for face 1, 1/13 = 7.69% for face 2, and 8/13 = 61.54% for any other face. This is much easier than using Bayes’ Theorem directly.

And a final way to look at probabilities is by using the evidence function, $e(H|DX) \equiv 10\log_{10}O(H|DX)$, which is measured in decibels and cashes out to:

$e(H|DX) = e(H|X) + 10\log_{10}\left[\frac{P(D|HX)}{P(D|\bar HX)}\right]$

Now, while this doesn’t keep the niceness of odds when dealing with more than two hypotheses, it has a few other advantages of perspective. The first is that it’s additive: if a given hypothesis has prior probability 0.01 or prior evidence of -20dB, and you observe evidence that’s 1,000 times more likely when that hypothesis is true than when it’s false, the evidence shift is $10\log_{10}1,000 = 30\text{dB}$ and the posterior evidence is $-20\text{dB}+30\text{dB}=10\text{dB}$ which is a posterior probability of 0.91. As new pieces of evidence are added to the mix, we just add and subtract to the evidence thus far collected to arrive at our final conclusions.

The second niceness is the one that shows that 0 and 1 are not probabilities. How much evidence would it take to raise a hypothesis to certainty?

$e(H|X) = 10\log_{10}\frac{P(H|X)}{P(\bar H|X)} = 10\log_{10}\frac 1 0 = +\infty$

And the symmetric argument shows that the evidence needed for negative certainty is also infinite. Just like positive and negative infinity aren’t real numbers, they’re just representations of the boundlessness of real numbers, so are 0 and 1 just representations of the extreme limits of perfect platonic certainty.

And the final interesting perspective is best seen when looking at this graph of evidence against probability:

To get from 10% probability to 90% probability, you just need 20dB of evidence. Well, “just” 20dB means evidence that’s 100 times more likely when your hypothesis is true than otherwise. But look at the extremes of that graph. There are two singularities. The one close to 1 shows that, once you are very sure of your hypothesis, evidence can change your mind only very little. The distance between 0.5 and 0.6 in terms of evidence is much much less than the distance between 0.999 and 0.9999.

The other singularity, however, shows one of the most important things in probability theory: the vast majority of the work needed when proving your hypothesis is in just figuring out which hypothesis is the right one. Think about a phenomenon you want to explain, like gravity. Think about all the possible hypotheses that could explain it. Little angels pushing massive stuff. A witch did it. Invisible ropes tied around every particle. A force. The curvature of spacetime. Take that last hypothesis alone. Saying “spacetime curves” is still a very general statement. How does it curve? What’s the degree of curvature? There’s a huge number of possible equations that could describe the topology of spacetime under the influence of mass.

Hypothesis-space is gigantic, it’s so big you can’t even imagine it. The amount of evidence you need just to point somewhere in it, just to say “the right hypothesis looks like this” or “the right hypothesis is in this area,” is astoundingly huge. Do you see that? Once you’ve actually found the correct hypotheses, the work necessary to become reasonably sure of it is nothing, it’s negligible.

When asked what he would’ve said if, in 1919, Sir Arthur Eddington had failed to confirm General Relativity, Einstein famously replied, “Then I would feel sorry for the good Lord. The theory is correct.” But that’s not just arrogance (27 bits = 80dB of evidence, to translate Yudkowsky’s notation to ours). Just to pinpoint those particular equations to describe Nature, Einstein needed a lot of work. And he probably had much more evidence than strictly required, too, because humans often underestimate the probability shift they should suffer under new evidence and are very inefficient.

Just to find the right answer, you already need to work your butt off. Otherwise, you’re jumping to a 80dB conclusion with much less than 80dB of evidence. The vast majority of the work is there. Incidentally, that’s why it’s so hard to find the elusive Theory of Everything. Just to pinpoint the correct equations we need a lot of work.

After you’ve found the correct hypothesis, confirming it is comparatively a piece of cake.