Propositional Logic

…is the basic form of logic whose atoms (or the basic structures upon which it acts) are propositions. A proposition is nothing more than a sentence in some language that states something. Within propositional logic, each sentence can have one of two different truth-values: True or False.

Propositional Logic comes from a more basic kind of logic called Aristotelian Logic. In Aristotelian Logic, you had only some sentences and conclusions that followed from those sentences. You had:

  • Universal and affirmative sentences: All men are mortal;
  • Particular and affirmative sentences: Some men are mortal;
  • Universal and negative sentences: All gods are immortal;
  • Particular and negative sentences: Some gods are immortal.

And then you also had particular sentences, such as “Socrates is a man.” And you applied logic in steps:

  1. All men are mortal;
  2. Socrates is a man;
  3. Socrates is mortal.

When following Aristotelian Logic, you can’t believe propositions (1) and (2) above and disbelieve proposition (3). It is a logical consequence of the former.

Propositional Logic, then, is a structural form of Aristotelian Logic. Each proposition is represented by a symbol, and you operate on many propositions by using a number of operators. Because of the nature of this kind of logic, which will be clear soon, it is frequently called Propositional Calculus.

Propositional Logic is a system of reasoning. As such, it is not designed to, nor can it, tell the truth. It cannot reliably distinguish truth from falsehood on itself. All it does is take the things you already believe in (the axioms) and tell you what their consequences are. It is, of course, impossible to automatically infer all the consequences of everything you believe. There is no process that can compute all of them. However, logic does help you getting from A to B, when A is an axiom.

Therefore, we start with axioms which we know to be true (or, conversely, know to be false) and repeatedly apply the rules of the system to arrive at things that we should know.

Let’s be more concrete. Suppose I have a proposition P = ‘It is raining outside.’ That is a proposition that can either be true or false at any given time. However, knowing its truth or falsehood does not matter. Suppose we have the proposition Q = ‘The ground outside is wet.’ The propositions P and Q, then, are thus linked:

P\rightarrow Q

This is a proposition itself, and can be either true or false, depending on the particular P and Q we’re talking about. But what does it mean? It means (and is read) ‘if P then Q.’ Condition is a logical connection which states that, whenever the left-hand proposition (P, in this case) is true, so is the right-hand proposition (Q, in this case).

In the example given, for these specific P and Q, the proposition P\rightarrow Q  is true: ‘If it is raining outside then the ground outside is wet.’

A common representation of truth and falsehood in Propositional Calculus is 1 and 0 respectively. And we sometimes represent the meaning of our propositions by something called truth-tables. But I’ll talk about those later. Let’s develop further our connectives and operators.

  • P\rightarrow Q : Conditional. If P then Q;
  • \bar P : Negation. This unary operation creates a proposition that’s true whenever P is false, and false whenever P is true;
  • P\land Q : Conjunction. This is a binary operation that’s true if and only if both P and Q are true;
  • P\lor Q : Disjunction. This is a binary operation that’s true if and only if at least one of P or Q is true. If both are true this is also true.

No other rules are required. In fact, I’ll soon show that all of these four rules can be in fact derived by repeated application of a single rule. And even some of these operations are equivalent amongst themselves.

Let’s get back to logical implication P\rightarrow Q . This proposition can be true regardless of the independent values of P or Q. However, it creates a logical constraint on the mutual values of P and Q. What do I mean by that?

\begin{array}{rl}  1. & P \rightarrow Q \\  2. & P \\  \hline  \therefore & Q  \end{array}

If I believe that P\rightarrow Q  is true and that P is also true, it is impossible for me to believe Q to be false. That is inconsistent. Therefore, we might create a truth-table for the proposition P\rightarrow Q  of which I spoke earlier.

image

Can you see at a glance why this table has these values? If P is true and Q is true, then P\rightarrow Q is true, too. If P is true and Q is false, however, P\rightarrow Q  cannot be. I just said that: I cannot believe that P is true (‘It is raining outside.’) and Q is untrue (‘The ground outside is not wet.’) if I believe that P\rightarrow Q  is true (‘If it is raining outside then the ground outside is wet.’)

If P is false and Q is true, our proposition P\rightarrow Q can still be true. Think again about the example: it is trivially true that ‘If it is raining outside then the ground outside is wet.’ However, just because it isn’t raining outside doesn’t mean the ground outside can’t be wet: ‘if it is not raining outside I cannot affirm that the ground outside is or isn’t wet.’ Someone could have just thrown a bucket of water on the ground and then we’d have that Q is true but P is not. And finally, if both are false, P\rightarrow Q  is still true for the same reason.

So the only combination of truth-values for P and Q that makes P\rightarrow Q untrue is P true and Q false. Keep this result in mind, it will come up again soon.

The following simple truth table represent just the operation \bar P . Quite straightforward.

image

And then we have P\land Q  and P\lor Q :

image

image

Hmmm… now hold on just a minute. In the above P\lor Q  there is exactly one combination that yields false: P and Q both being false. But in the P\rightarrow Q  case, there is also one combination that yields false: P true and Q false. So P\rightarrow Q  has the exact same truth table as \bar P\lor Q :

image

Why is that? Because P\lor Q  is true when P being true implies Q being true. So if P is true and Q is false, P\lor Q  can’t be true. Now what about we make a table for \bar Q\rightarrow \bar P ?

image

…again the same. So we have the following equivalences: (P\rightarrow Q)\leftrightarrow (\bar Q\rightarrow \bar P)\leftrightarrow (\bar P\lor Q) .

Now hold on a minute. What does \leftrightarrow  mean? It’s the symmetric version of \rightarrow : P\leftrightarrow Q  means that ‘If P then Q and if Q then P.’ It is a stronger assertion, which has the following truth table:

image

Anyway, I’m digressing here. What I mean is that those three different propositions have the same truth tables, and are therefore equivalent. This equivalence means that whenever one is used the other can, too, without loss of meaning. Thus, as promised, that list of four rules can be reduced to a list of three rules, because \rightarrow  can be obtained from a clever combination of \bar P  and \lor .

—-

Okay, so far I’ve introduced the main rules and the truth tables. This is basic Propositional Logic. However, I haven’t actually given you any axioms in the system on which we can apply these rules.

Here’s the trick: there aren’t any. Other than the axioms which define the rules and link them with the truth-tables, Propositional Logic doesn’t deal with anything else. Rather, it asks you for your axioms and tells you what can be said about them.

And now I’m going to prove new rules which we can use as shortcuts when finding things out. For that, though, I’ll need a way of creating new truths based on the ones I already have. Hofstadter in GEB calls it a “fantasy.”

It goes like this: first, I assume something is an axiom. Then I apply the rules I already know. When I’m done, whatever conclusion I get will be a logical implication of the axiom I assumed. To use Hofstadter’s notation, all “fantasies” are going to start with a [ and end with a ]. Also, everything that’s true outside of a fantasy is also true inside it.

[

(1) P\land Q  : Axiom 1

(2) P : Definition of logical conjunction

]

(3) (P\land Q) \rightarrow P : Result of fantasy

This is a very simple example. First, I opened a fantasy. Then, I invented an axiom. What does it mean, though, to invent a fantasy in this context? It means I take some proposition I like (in this case, the proposition I chose was P\land Q ) and pretend it’s true until the end of the fantasy.

After that, I used the definition of logical conjunction to say that P is true. After all, I assumed that P\land Q  is true, and the definition of logical conjunction says that P\land Q  is true if and only if both P and Q are true at the same time. So while I was inside that fantasy where I just assumed P\land Q , P was necessarily true.

Then I ended the fantasy. After I end the fantasy, I can take the axiom P\land Q  and the conclusion P and say that the axiom necessarily implies the conclusion. So the rule of fantasy is a way of creating new rules inside Propositional Logic. This rule I derived (P\land Q) \rightarrow P is called “Simplification.” It’s pretty obvious, really, but it’s just an example.

I’ll give another example of a nested fantasy.

[

(1) (P \rightarrow Q)\land(Q\rightarrow R) : Axiom 1

(2) (P \rightarrow Q)  : Simplification of Axiom 1

(3) (Q\rightarrow R)  : Simplification of Axiom 1

[

(4) P : Axiom 2

(5) Q : Consequence of Axiom 1 and Axiom 2

(6) R : Consequence of Axiom 1 and (5)

]

(7) P\rightarrow R  : Result of fantasy’s axiom (4) and conclusion (6)

]

(8) ((P \rightarrow Q)\land(Q\rightarrow R))\rightarrow (P\rightarrow R)  : Result of fantasy’s axiom (1) and conclusion (7)

Anything you prove with a fantasy is a theorem of Propositional Logic.

—-

Now we have two methods of proof of theoremhood in Propositional Calculus: truth tables and fantasies. If two propositions have the same truth table, they are equivalent; if you can use a fantasy to get from one proposition to another, the latter is a logical implication of the former. So whenever you want to prove the equivalence of propositions you use truth tables, and whenever you want to prove the implications of a proposition you use a fantasy. Now let’s prove a few nice things.

First, let’s invent a proposition P\land Q . In that case \overline{(P\land Q)}  is its negation. Now let’s find the truth table.

image

Now if you got the gist of it, you’ll see that the kind of binary truth table that only has one false value and three true ones is the truth table for logical conjunction (\lor ). Therefore, \overline{(P\land Q)}  must be equivalent to some logical disjunction. And indeed, it is: you see, the proposition \overline{(P\land Q)}  is only true when either P or Q is false. It must be a logical disjunction between the negations of P and Q then!

image

This is the first De Morgan’s Theorem: \overline{(P\land Q)} \leftrightarrow (\bar P\lor\bar Q) . I’ll leave it to you to prove the second De Morgan’s Theorem: \overline{(P\lor Q)} \leftrightarrow (\bar P\land\bar Q) .

Now I’m going to use another truth table to prove a property of PL’s rules. For that, though, I’ll want a better notation than \land  and \lor : whenever I have the logical connective and (\land ), I’ll replace it with a multiplication. So, for instance, P\land Q  = PQ. Whenever I have the logical connective or (\lor ) I’ll replace it with an addition. So, for instance, P \lor Q  = P + Q. If you’ve read my posts on Bayesian Reasoning and Probability Theory you’ll see that that’s my preferred notation.

image

P(Q + R) is a logical conjunction of P and (Q + R). So whenever P is true and (Q + R) is true, P(Q + R) is, too. But when is (Q + R) true? When either Q or R or both are true. And therefore we have the above truth table. Now, you can interpret that under a different light:

image

Now we have a new proposition: PQ + PR. When is that true? When either PQ or PR or both are true. And again, PQ is true only when both P and Q are true, and PR is true only when both P and R are true. So in the table I inserted all parts needed to find out whether PQ + PR is true. And what did we find?

Well, the truth table for PQ + PR is exactly like the one for P(Q + R). This is called the distributive property. It’s reminiscent of maths’ property: if you have a logical conjunction over a logical disjunction, you can distribute the conjunction over the disjunction.

Now, a property that maths doesn’t have is that in PL, the “multiplication” distributes over the “addition”: P + (QR) = (P + Q) (P + R). I’ll leave that one for you to prove, too. Now you have enough tools to prove whatever’s provable within PL.

—-

I mentioned that it’s possible to have only one operand instead of the aforementioned four (conjunction, disjunction, negation, implication). I’ll prove you that now.

There are, in fact, two possible operands that, on their own, can reconstruct all of the above ones. They are the NAND and the NOR operands:

  • P\uparrow Q is the NAND operand. It is logically equivalent to \overline{(P\land Q)} .
  • P\downarrow Q is the NOR operand. It is logically equivalent to \overline{(P\lor Q)} .

So, suppose you forgot about conjunction, negation, etc. You only know… NAND. Can you recover them with only NAND? Certainly:

  • P\uparrow P\equiv \bar P gets negation.
  • (P\uparrow P)\uparrow (Q\uparrow Q) \equiv (P\lor Q) gets disjunction.
  • (P\uparrow Q)\uparrow (P\uparrow Q) \equiv (P\land Q)  gets conjunction.

And of course, since we know implication is just a combination of negation and disjunction, we’re done. We can recover all of the regular propositional operands from NAND alone. I’ll leave it to you to prove that we can do the same with NOR.

—-

Remember when I said Propositional Logic is a natural extension/formalisation of Aristotelian Logic? Well, I lied. AL is still a little bit more… flexible than PL. That’s because PL is restricted in what it can prove to single sentences. I have the sentences P = ‘Socrates is a man’ and Q = ‘Socrates is mortal’ but I can’t really talk about properties of “all men” and stuff like that. That’s actually covered by First-Order Logic, of which I’m going to speak at a later time. Seeya!

What is a mathematical proof?

The answer is… much simpler than one might think. A mathematical proof is just a set of steps that only produce true statements given the premises.

“Huh?”

Well, when you want to prove something mathematically, you start with a few premises A, B, C, etc. And then you apply operations upon the premises that will always yield true statements given those premises, and you get to where you wanted to.

“Um… what exactly… does that mean?”

Well, I’ll show you a way not to do it. I’ll start with the premise A: x = y plus the usual rules of arithmetic. And then I’ll produce a number of operations upon this premise and prove that 1 = 2.

“What?!”

Don’t worry, the proof will be wrong. And I’ll show where. But let’s go, then.

\begin{aligned}  x &= y\\  x^2 &= xy\\  x^2 + x^2 &= x^2 + xy\\  2x^2 - 2xy &= x^2 - xy\\  2(x^2 - xy) &= 1(x^2 - xy)\\  2&=1  \end{aligned}

Let’s see what Idid wrong.

  1. Premise.
  2. Multiplied both sides by x.
  3. Added x^2 to both sides.
  4. Subtracted 2xy from both sides.
  5. Rearranged the numbers.
  6. Divided both sides by (x^2-xy).

As I said, we need all the steps to produce only true results. If our premise is an equality, then which of the above operations on an equality doesn’t always produce true statements?

  • Multiplying both sides of an equality by the same number always produces true statements.
  • Adding the same number to both sides of an equality always produces true statements.
  • Subtracting the same number from both sides of an equality always produces true statements.
  • Rearranging the symbols we’re using, without changing their truth-value, is a null-operation.
  • Dividing both sides of an equality by the same number always produces true statements, unless that number happens to be zero.

Unless that same number happens to be zero. We had defined that x = y, right? So how much is (x^2-xy)?

x^2 - xy = x*x - x*y = x*x - x*x = 0

“Oh! So since one of the operations you performed didn’t always produce a true statement, you picked exactly the case where it wouldn’t, and with that you proved something impossible!”

Yes, exactly. Just look at it: 0*3 = 0*528. That’s obviously true. In fact, you can replace 3 and 528 by any numbers at all, and that will still be a true statement. But you can’t divide both sides by 0, because that is not an operation that produces only true statements from true statements. And that’s exactly what we did, by cleverly hiding the invalid step behind murky language.

So that’s what a mathematical proof is: a number of operations applied on true statements that always yields true statements. Let me give you an example of a mathematical proof that is valid, and was used in another post of mine.

Premises:

\begin{aligned}  A&: P(X)+P(\bar X) = 1\\  B&:0 < P(X) < 1\\  C&:P(X) + aP(\bar X)<1  \end{aligned}

Since I’m always talking about probabilities, I will use this opportunity to show what a mathematical proof is with one that is useful to the logic of a previous post. Based on these premises, I want to prove the conclusion Z: a < 1. But first I’ll give you the meaning of the premises.

Premise A means that the probability that a proposition is true plus the probability that it is false equals 100%. That is, no matter what the proposition X stands for, we are always sure that it’s either true or it’s not.

Premise B means that we think X is possible, but not certain (in fact, we can never be certain of anything at all, so that’s always valid).

Premise C means that by multiplying the probability that X is false for some a, adding it to the probability that X is true will give a number that’s smaller than one.

Conclusion Z means that the only way the above premises are true is if a < 1.

Let’s go step-by-step. First, we have premise C.

1.\ P(X) + aP(\bar X) < 1

Now let’s perform an operation: adding 0 to the left side of the equality.

2.\ P(X) + aP(\bar X) + 0< 1

That is true because the operation of adding 0 to a side of an equality doesn’t change its truth-value. Now we’ll remember that, by the definition of subtraction:

3.\ P(\bar X)-P(\bar X) = 0

Since a number is always equal to itself, subtracting a number from itself always equals 0. Now, substituting 3. on the left-hand side of 2.:

4.\ P(X) + aP(\bar X) + P(\bar X)-P(\bar X)< 1

Replacing a number by another that is equal to it is an operation that always maintains the truth-value of the statement. And now we’ll rearrange the above numbers.

5.\ P(X) + P(\bar X) + aP(\bar X)-P(\bar X)< 1

Now if we replace P(X) + P(\bar X) with 1:

6.\ 1+ aP(\bar X)-P(\bar X)< 1

This is a valid step because of premise A, and because replacing a number by another that is equal to it always maintains the truth-value of the statement.

7.\ aP(\bar X)-P(\bar X)< 0

We subtracted 1 from both sides of the inequality. Subtracting a number from both sides of an inequality always produces another true inequality. And so, we will add P(\bar X) to both sides, and get:

8.\ aP(\bar X)<P(\bar X)

And then we can divide both sides of the inequality above to reach our conclusion.

9.\ a< 1

We were allowed to do that because P(\bar X) is a positive number, and dividing both sides of an inequality by a positive number maintains its truth-value. I will leave it for you to prove that P(\bar X) is a positive number. You will only need premises A and B.

So we just performed a number of steps which, when working in conjunction with the previous statements, always produced new true statements. Since we know that about each of the steps, then we can trust the truth of the conclusion.

Of course, I chose a rather simple example that could be reliably guessed by pure intuition. Still, it’s good to understand what it means to actually prove stuff mathematically. Formally, that is. Mathematics is an art, the art of discovering true things based on other true things. And sometimes you will discover the most incredible and counter-intuitive things by just applying a number of steps that produce only true statements from other true statements.

Talking to laypeople (or the inference gap)

Einstein once famously said that if you can’t explain it to your grandmother then you don’t understand it yourself. I’m too lazy to look it up but I’m pretty sure that when he said that he was talking about how no one really understood Quantum Mechanics because no one could explain it to anyone who didn’t know a lot of maths.

It’s been 100 years and now we understand Quantum Mechanics.

You may have noticed people still fail to teach it even to other scientists.

I think what Einstein missed in his analysis was exactly the sheer amount of time it would take to teach the really advanced topics to anyone. Many fields of knowledge, science most of all, have been built throughout the centuries. Human knowledge has been waxing by the day and it always builds upon previous knowledge.

So you can imagine just how frustrating it is when you discover the true behaviour of subatomic structures and you try explaining it to a journalist and their headline is “Quantum collapse: proof that souls exist!”

There is a reason why scientists, especially the ones working on fringe science, on discovering the new things, are so unwilling to talk to people outside their areas of expertise about their field. They have a hard time explaining to each other what their work is about, they’re creating new words to even be able to communicate those concepts. And then some silly journalist wants a nice soundbite about some recent discovery on QED and expects to be able to actually understand it and pass that knowledge on. And they get angry if you tell them they couldn’t possibly understand that.

But of course they couldn’t, they haven’t studied what they needed to!

Yudkowsky sometimes entertains the idea that maybe if Science was some Secret Conspiracy with an Initiation and Grandmasters and Secret Rituals more people would be interested in it. I can certainly see his point. People seem to think that since all of scientific knowledge is free (ha - says the one who hasn’t needed to buy all those huge textbooks and spend literally months and perhaps even years trying to solve problems in their field) it’s worthless and also should be easy to understand. If it had any value people would charge for it, no? It wouldn’t be available to everyone.

The thing is, it isn’t available to everyone. It takes dedication and discipline and study and hard work and years of practice before you can actually do science. And when that happens, you have a lot of knowledge. You’re starting to grok concepts that are too far removed from our daily intuitions created to deal with food running in the savannah. And this gigantic body of knowledge you built throughout your years of study since you were a freshman is what’s called aninference gap between you and anyone who hasn’t studied the same things you have. It isn’t directly measurable in any strict sense, but it’s intuitively the distance between the knowledge needed to understand a concept and the knowledge possessed by the person who wants to learn it. It’s what they have to understand before they can understand the matter at hand.

To understand General Relativity you need to learn basic maths, basic geometry, calculus, differential geometry and mechanics. Each of these fields can be further expanded and elaborated upon, showing exactly what needs to be learned. It’s complicated enough that the textbooks needed to learn all of that can be piled up and serve as a table for four. Perhaps more than four.

It’s impossible to make a layperson truly understand what you’re working on, if it’s too removed from our daily reality, without having to teach them everything youhad to learn before. Yes, you can certainly teach Quantum Mechanics to your grandmother, but only after she has already learned matrix calculus.

Orthodox test statistics and the absence of alternatives

Recently, while reading Jaynes’ Probability Theory: the Logic of Science, I was overcome by an urge to rant about some of the stuff I read there, and one of the things I said was that, basically, orthodox statistics sucks.

Well, first off, what is this orthodox statistics I’m talking about, and what would non-orthodox statistics be?

That… is a very long story. I’ll try to make it more reasonably short. Basically, in the past few hundred years the field of “probability” showed up to measure a whole buncha things, all of them more or less about giving a formal description of how to draw valid conclusions from experiments, what you should believe, etc. It’s supposed to be, in Jaynes’ words, the logic of science.

Enter Sir R. A. Fisher and some others who are very opposed to the idea that probability theory as logic makes any sense, and then they completely destroy the field with nonsense about potentially infinitely repeated experiments, ad hoc tools, and objective probabilities. I’m not going to go much further into this story, but rather show an example of the failures of orthodox statistics, and how Bayesian Probability Theory does it better.

I’ll use Pearson’s \chi^2  (read: chi-squared) test. If you have more than a passing acquaintance with orthodox statistics, you probably know the \chi^2  test. Even if you don’t, it’s not in fact too unlikely that you may have heard of it at some point or another. It’s something called a “test statistic” and it’s what orthodox statisticians use for hypothesis testing.

\chi ^2 \equiv \sum\limits_{i=1}^m\frac{(O_i-E_i)^2}{E_i}

It’s a purported measure of “goodness of fit” of a hypothesis to the observed data. Basically, given m possible outcomes of an experiment repeated a bunch of times (the data), the “null hypothesis” says that the expected number of times the i^{th} outcome is observed in the experiments is E_i . O_i  are the actual observed numbers. The closest the above number is to zero, then, the closest the observed distribution is to the expected one. And in fact, E_i  and O_i  don’t have to necessarily be numbers of trials; they can be any kind of random variable, as long as they’re all greater than or equal to 5, or they’re all greater than or equal to 1 and at least 20% of them are greater than or equal to 5 (the fact that this caveat exists should start making you suspicious).

Now, orthodox statisticians have developped quite a theory around this. Here’s how it works. There’s a probability density function called the chi-squared distribution, and if the random variables are “independent and standard normal,” the sum of their squares is distributed according to that pdf. Then, one compares the calculated value with that distribution (or, rather, with the cumulative distribution) and figures out what the “p-value” of that test statistic is.

image

In the above, the degrees of freedom equals intuitively the number of things that “can vary” in the distribution. For example, when throwing a six-sided die, there are n = 6 outcomes but k = 5 degrees of freedom, because if we didn’t observe the five first outcomes, we necessarily observed the sixth. The number k isn’t necessarily always equal to n – 1, however. It is rather n – p where p = s + 1 and s is the number of “co-variates” in the distribution. This isn’t really important to the understanding of the concept, and different fittings have different numbers of co-variates.

But that only moves the question to: what is a p-value?

You might have heard of or seen papers and studies that say that a conclusion was reached with p < 0.05. The intuition behind the p-value is that you take some test statistic (such as the \chi^2 ) and you figure out what the probability that, on the null hypothesis, that test statistic would have a result at least as extreme (i.e. different from expected) as the one observed is. Therefore, in an experiment with for instance one degree of freedom (such as the tossing of a coin), values of \chi^2  greater than approximately 4 have a probability less than 5% of being observed.

And what’s the point of this? Orthodox statistics holds that if the p-value for your chosen statistic is less than some arbitrary number, then you should reject whatever your “null hypothesis” is (one would suppose that in the case of a coin the null hypothesis says the coin is fair). And… that’s it. Orthodox statistics tells you that if, “on chance alone,” there’s a less than (say) 5% probability of observing what you did, you should mark the null hypothesis as false.

Note that, for orthodox statistics, probability is the same thing as frequency. So what they really mean is that, if the results obtained would be observed in less than p (which is usually, as mentioned, 5%) of similar experiments if the null hypothesis were true, then you should reject that hypothesis.

Now, there is a huge number of things wrong with this picture. Let’s see what they all are.

First, the p-value chosen is completely arbitrary. And the usual one, 5%, is, to say the least, huge. One twentieth of correct results would be rejected by that. If you’re not absolutely outraged by this, you should be. If I recall correctly, physics uses a value like 0.01, which is better, but only 5 times so, and still completely arbitrary.

Second, using different test statistics may give you different p-values. If you use the \chi^2  you may get p = 0.07 but using the Student’s t-test will get you p = 0.04 (I’m guessing here when using these two particular examples, for all I know these two may be correlated in a way that makes this impossible, but the general principle holds that two test statistics may not say the same thing). So even the choice of which test statistic to use is arbitrary (somewhat, because some test statistics can’t be used for some kinds of experiments).

Third, the list of possible test statistics to pick is also arbitrary! Why would you choose the \chi^2  as a measure of fit? Why compare your \chi^2  to some imaginary \chi^2  distribution as opposed to comparing, say, \chi , or \chi^3 , or \chi^4 ? Why use a t-distribution? Why those tests and not others? Since these test statistics are really only based upon certain statisticians’ intuitions, one shouldn’t expect to have a consistent answer to this.

Fourth, as I illustrated with the conditions of applicability of the \chi^2  test, they’re not universally valid. Since they’re ad hoc tools, they’re only applicable to some specific domains where they’re supposed to more-or-less work somewhat reliably.

And fifth… a p-value makes absolutely no mention of alternatives or prior knowledge. It says you should “reject” a hypothesis if its arbitrary p-value is below some arbitrary threshold, and “keep” it otherwise, but it doesn’t tell you what to use in its place. To the orthodox statistician, prior knowledge is metaphysical nonsense, and so is to talk about the probability of a hypothesis P(H|X)! But if you can’t talk about that, then you can’t really say whether you should trust your hypothesis, and what they offer instead are these test statistics that only tell you how well your hypothesis predicted the data, and they don’t do a very good job at that either.

See, this is what I mean when I say sometimes I look at orthodox statistics and screech in frustration.

One could argue that if some hypothesis H  gives a low p-value, then you could just use the hypothesis \bar H  instead, which is just the “ensemble of all hypotheses that are not H .” That too makes no sense because… well, \bar H  is an average of every hypothesis that’s not H , and it’s quite likely that the vast majority of them are worse fits to the data than H  itself, which would make \bar H  have an even lower p-value than H.

In Bayesian terms, given some observed data D  and some statistic s(D) , the p-value is defined as P(s(x) \text{ at least as extreme as } s(D) | HX) , which we’ll call just P(s|HX) . Now, P(s|\bar HX) = \frac{P(\bar H|sX)P(s|X)}{P(\bar H|X)} which… doesn’t actually make any reference to P(s|HX) ! That is, the p-value of some data on some hypothesis H is not directly coupled to the p-value of that data on its negation, and they can be both arbitrarily low!

So if you don’t specify an alternative hypothesis, orthodox statistics is silent. It just tells you that its arbitrary, ad hoc tool says your null hypothesis is bad and you should feel bad.

And as I mentioned, we’re not in fact interested in P(s|HX) , we’re interested in P(H|DX) which is:

P(H|DX) = \frac{P(D|HX)P(H|X)}{P(D|HX)P(H|X)+P(D|\bar HX)P(\bar H|X)}

So even if we choose to approximate P(D|HX) by P(s|HX) , this still does mention P(s|\bar HX) , which can be less than P(s|HX) , and it also mentions the prior knowledge P(H|X) , which is completely ignored by the test statistic itself.

Furthermore, the fact that the statistic is arbitrary means that it will almost surely give you nonsensical results when pushed to a domain it wasn’t specifically designed to deal with, unless by luck you happen to get a test statistic that is actually derivable from Bayes’ Theorem. I’ll show you both things now with a little story.

Suppose Mr. E is on Earth with a British pound coin and his assigned probabilities are that, after it’s tossed, the probability of heads is 0.499, the probability of tails is 0.499, and the probability of landing on the exact edge is 0.002. However, he tells Ms. M, in Mars, only that he’s running an experiment that has three possible outcomes. By the principle of indifference, Ms. M’s null hypothesis is that each outcome has a 1/3 probability of being observed.

Then Mr. E tosses the coin 29 times, and it turns up heads 14 times, tails 14 times, and edge once. He tells her that, and they calculate the \chi^2  of that data based on their null hypotheses.

Mr. E’s expected results are 29*0.499 heads, 29*0.499 tails, and 29*0.002 edges. His \chi^2  is, then:

\chi^2_E = 2\frac{(14-29*0.499)^2}{29*0.499}+\frac{(1-29*0.002)^2}{29*0.002} = 15.33

Now, this is a bit disconcerting. If you look at a chi-squared distribution table, it tells us that after observing this data, Mr. E should reject his hypothesis as false!

But wait, it gets worse. Now Ms. M calculates her own \chi^2 :

\chi^2_E = 2\frac{\left(14-29*\frac 1 3\right)^2}{29*\frac 1 3}+\frac{\left(1-29*\frac 1 3\right)^2}{29*\frac 1 3} = 11.66

The table also says that her hypothesis should be rejected, true, but the value of the \chi^2  says that it’s a better fit than Mr. E’s! She will look at her \chi^2 , look at Mr. E’s, and say, “Well, our hypotheses are both fairly flawed, but yours is certainly worse than mine!”

That’s clearly preposterous. Most people trained to use the \chi^2  test will be very surprised by this, and will try to check this calculation, but it’s correct, and it’s just what one would expect from using completely arbitrary tools not derived from basic principles. Note that this isn’t even outside of the scope of the test, given that 2/3 of the quantities are greater than 5 and 1/3 is equal to 1.

How would a Bayesian do it, then?

Well, I have mentioned that the strength of a hypothesis can only be measured against other hypotheses. So Bayes will never tell you to outright reject some hypothesis unless it can clearly indicate you an alternative.

And with that philosophy… it turns out that we can, in fact, devise a test that’s more or less like a test statistic, but has the advantages of being derivable from first principles (which means it is universally valid) and it doesn’t actually say “reject this hypothesis” or anything like that, but rather what says “this data can not support competing hypotheses by more than some given amount.”

I’ll explain. I’ve talked about the evidence which is exactly equivalent to Bayes’ Theorem. Suppose I have some background information X, some data D, and a hypothesis H. Then, if I’m comparing it to any single alternative hypothesis H' (that is, in my restricted universe of discourse the negation of H is H'), I have that:

e(H|DX)=e(H|X)+10\log_{10}P(D|HX)-10\log_{10}P(D|H'X)

Now I’m going to rewrite the above as

e(H|DX) = e(H|X)-\psi

where

\psi\equiv-10\log_{10}P(D|HX)+10\log_{10}P(D|H'X)

Since P(D|HX) and P(D|H'X) are both \leq 1, their logarithms are negative or zero, so the first term of the above is positive and the second is negative. Now, hold H fixed, and let H' range over every possible hypothesis. \psi attains its maximum value when H' is the “sure thing” hypothesis that says that everything that was observed couldn’t have been otherwise, P(D|H'X) = 1. In that case, we have \log_{10}P(D|H'X)=0 and can define the following quantity:

\psi_{\infty}\equiv-10\log_{10}P(D|HX)>0

And thus

e(H|DX)\geq e(H|X)-\psi_{\infty}

necessarily. This means that there is no possible alternative hypothesis which data D could support, relative to H, by more than \psi_{\infty} decibels. This is to say that Probability Theory cannot answer “How well does data D support hypothesis H?” because that question makes no sense; rather, it answers “Are there any alternatives H' which data D would support relative to H? How much support is possible?”

And this is what I’d call a “Bayesian test statistic.” It’s a number that depends exclusively on the hypothesis under consideration and the data, and the closer it is to zero, the better hypothesis H fits data D. And unlike the \chi^2 test, this was derived directly from Bayes’ Theorem, which means that it will not suffer from the weaknesses of the ad hoc orthodox tools. For any given hypothesis, the closest that number is to zero, the best fit it is. So, even though this statistic doesn’t directly mention an alternative hypothesis, it’s still Bayesian in spirit because it’s based on an implicit class of alternative hypotheses.

And this number is also, unlike the \chi^2, universal. You can use it for any class of hypotheses. But if we do restrict our attention to the domain where the \chi^2 is applicable, we can actually do even better than the \psi_{\infty}.

Suppose the hypothesis H is in the “Bernoulli class” (that is, n consecutive trials with m possible outcomes, each outcome with a probability that’s independent of other outcomes, like the toss of a coin). If we call the k^{th} trial x_k, then the data D is just all of those outcomes one after the other, and we have that

 P(D|HX)=P(x_1...x_n|HX)=p_1^{n_1}...p_m^{n_m}

where the n_k are the number of times the k^{th} outcome was observed in those m trials, and p_k are the probabilities that a given trial had that outcome, with \sum_kp_k=1 and \sum_kn_k=n being the total number of trials.

Now, given any observed sequence of outcomes D, the hypothesis that fits it best is the one that predicts the exact observed frequencies. To show this, let’s call the observed frequencies f_k=\frac{n_k} n. Using the fact that, on the positive real line, \log(x)\geq(1-\frac 1 x) with equality iff x = 1, and choosing x = \frac{n_k}{np_k}, we have that

\sum\limits_{k=1}^mn_k\log\left(\frac{n_k}{np_k}\right)\geq\sum\limits_{k=1}^mn_k\left(1-\frac{np_k}{n_k}\right)\geq 0

with equality iff p_k=f_k, and with m being the number of possible outcomes. Why is this relevant? If we call H^* the hypothesis of perfect fit:

\begin{aligned}  \sum\limits_{k=1}^mn_k\log\left(\frac{n_k}{np_k}\right) &= \sum\limits_{k=1}^m(\log f_k^{n_k}-\log p_k^{n_k}) \\  &= \log\left(\prod\limits_{k=1}^mf_k^{n_k}\right)-\log\left(\prod\limits_{k=1}^mp_k^{n_k}\right) \\  & = \log P(D|H^*X) - \log P(D|HX)  \end{aligned}

So for any hypothesis H, the above is 0 only when H=H^*, which means the hypothesis with the best possible fit in the Bernoulli class is exactly H^*.

Now, let’s bring \psi back:

\psi\equiv-10\log_{10}P(D|HX)+10\log_{10}P(D|H'X)

But if you take H' to be H^*, then that’s just the thing we just derived. So we have a refinement:

\psi_B\equiv 10\sum\limits_{k=1}^mn_k\log_{10}\left(\frac{n_k}{np_k}\right)

The closest this number is to zero, the better your hypothesis fits the data. And it’s stronger than the \psi_{\infty}, because it restricts its attention to the Bernoulli class, and perfect fit means that the observed frequencies were exactly the predicted ones.

Now let’s see an interesting thing. Define a quantity:

\Delta_k=\frac{n_k-np_k}{np_k}

With this we can find the following three identities:

\sum\limits_{k=1}^mnp_k\Delta_k=\sum\limits_{k=1}^m\left(n_k-np_k\right)=\sum\limits_{k=1}^mn_k-n\sum\limits_{k=1}^mp_k = n-n = 0

1+\Delta_k=1+\frac{n_k-np_k}{np_k}=\frac{n_k}{np_k}

np_k\left(1+\Delta_k\right)=n_k

And let’s take a look at our \psi_B again. We can reexpress it using the last two identities:

\psi_B = 10\sum\limits_{k=1}^mnp_k\left(1+\Delta_k\right)\log_{10}\left(1+\Delta_k\right)

Using the Taylor expansion of the logarithm around 1, we have that \ln(1+\Delta_k)=\Delta_k-\frac 1 2\Delta_k^2+\frac 1 3\Delta_k^3 \pm .... Note that this expansion is only valid when \Delta_k is relatively small (it has to be between -1 and 1). We can use that:

\begin{aligned}  \psi_B &= 10\sum\limits_{k=1}^mnp_k(1+\Delta_k)\log_{10}(1+\Delta_k) \\  &\approx 10\sum\limits_{k=1}^mnp_k(1+\Delta_k)\left(\Delta_k-\frac 1 2\Delta_k^2+\frac 1 3\Delta_k^3 \pm ...\right) \\  &= 10\sum\limits_{k=1}^mnp_k\left[\left(\Delta_k-\frac 1 2\Delta_k^2 + \frac 1 3 \Delta_k^3 \pm ...\right)+\left(\Delta_k^2 - \frac 1 2\Delta_k^3+\frac 1 3\Delta_k^4 \pm ... \right)\right] \\  &= 10\sum\limits_{k=1}^mnp_k\left(\Delta_k+\frac 1 2\Delta_k^2 - \frac 1 6\Delta_k^3 \pm ...\right) \\  & \approx 10\sum\limits_{k=1}^mnp_k\left(\Delta_k+\frac 1 2\Delta_k^2\right) \\  & = 10\sum\limits_{k=1}^mnp_k\Delta_k +5\sum\limits_{k=1}^mnp_k\Delta_k^2 \\  & = 5\sum\limits_{k=1}^mnp_k\Delta_k^2  \end{aligned}

The second line is an approximation because we’re taking the logarithm to base 10 instead of the natural logarithm, and the last equality is true because of our first identity. Now if you expand that last sum a bit:

\sum\limits_{k=1}^mnp_k\Delta_k^2 = \sum\limits_{k=1}^mnp_k\left( \frac{n_k - np_k}{np_k} \right)^2 = \sum\limits_{k=1}^m\frac{(n_k-np_k)^2}{np_k}

But wait. n_k are just the observed outcomes O_k and np_k are the expected outcomes E_k of the \chi^2 test! And this means (correcting for the base of the logarithm) that:

\psi_B\approx 2.17\sum\limits_{k=1}^m\frac{(n_k-np_k)^2}{np_k} = 2.17 \sum\limits_{i=1}^m\frac{(O_i-E_i)^2}{E_i} \propto \chi^2

Or, in other words, the surprising result is that \psi_B \propto \chi^2!

Well. Not really. See, what’s really the case is that \chi^2 is a second-order approximation to \psi_B when the condition that all the \Delta_k have modulus less than one is met. And predictably, when that condition is not met, we will find the \chi^2 test to be lacking.

Let’s get back to our tossed coin example. What are the associated \psi_B?

\begin{aligned}\psi_{B_E} &= 10\left[28\log_{10}\left(\frac{14}{29*0.499}\right) + \log_{10}\left(\frac{1}{28*0.002}\right)\right] \\  &= 8.34\text{dB} \\  \psi_{B_M} &= 10\left[28\log_{10}\left(\frac{14}{29*\frac 1 3}\right)+\log_{10}\left(\frac 1 {29*\frac 1 3}\right)\right] \\  &= 35.19\text{dB}  \end{aligned}

The results agree much more nicely with our intuitions. What Ms. M finds out is that there is another hypothesis about the coin in the Bernoulli class that’s 35.2dB better than hers (that’s odds of over 3300:1), and that Mr. E’s hypothesis is better than hers by some 26.8dB and is only 8.34dB away from the best hypothesis in the Bernoulli class.

Why does this happen? Why is the \chi^2 test saying such outrageously different things?

There are two reasons. The first is that the squared part in the sum severely overpenalises outliers. The second is that then you divide that difference by the expected value, which also contributes to that. To see this, let’s suppose instead of observing an edge, we observed tails, so that there were 14 heads and 15 tails. In this case:

 \begin{aligned}  \psi_{B_E} = 0.30\text{dB} \ \ \ \ \ \ \ \ \chi_E^2 = 0.0925 \\  \psi_{B_M} = 51.2\text{dB} \ \ \ \ \ \ \ \ \chi_M^2 = 14.55  \end{aligned}

Now they agree. You see, under Mr. E’s hypothesis, he should’ve observed 14.471 heads, 14.471 tails, and 0.058 edges, and the \chi^2 amplified hugely that unexpected outcome of 1 edge.

A Bayesian test statistic talks only about comparisons between hypotheses, and Bayes would never tell you to reject a hypothesis unless it had a better alternative to offer. Furthermore, even then it wouldn’t say “reject it,” it would rather say “this hypothesis is a better fit for the data than that one,” and then you’d need to integrate your prior knowledge into it to know exactly what your posterior probabilities for the hypotheses ought to be.

Other ways of looking at probability

Bayes’ Theorem is all nice and dandy, but it may not necessarily be the best thing to work with. It’s quite simple:

P(A|BX) = \frac{P(B|AX)}{P(B|X)}P(A|X)

When laid out this way, what it says is that the probability of some proposition A after you learn that B is true equals its probability before you knew it was true times a factor we could call the impact of B on A . When explaining what the Bayesian meaning of evidence was, I mentioned the obvious thing that if B is more likely when A is true, then it’s evidence for it, and it’s evidence against it otherwise.

However, we can look at it another way. If we expand P(B|X) as explainable by a large set of mutually exclusive propositions \{A_0, A_1, A_2, ...\} , then we can reexpress the theorem as

P(A_0|BX) = \frac{P(B|A_0X)P(A_0|X)}{\sum_{i}P(B|A_iX)P(A_i|X)}

When you look at it like that, then the denominator becomes nothing but a normalising factor meant to guarantee that your probabilities sum to 1, and it’s P(B|A_0X) which does all the hard work. And then that term gets two different names depending on which way you look at it.

To explain it, let’s suppose A is some hypothesis we’re studying and B is the data collected. Let’s rename them H and D accordingly. Rewriting the theorem:

P(H_0|DX) = \frac{P(D|H_0X)P(H_0|X)}{\sum_iP(D|H_iX)P(H_i|X)}

The term P(D|H_iX) , when considered as a function of the data D for a fixed hypothesis H_i , is generally called the sampling distribution. For example, suppose the data is a string of results of a binary experiments and the hypothesis is that these results are in fact Bernoulli trials – which means that the probability that each trial will come out one way or another depends only on the specific way it can turn out, and not on previous trials. An example would be the tossing of a coin, and a possible hypothesis is that the coin turns up heads with probability p_i  and tails with probability q_i = 1 - p_i. Then, if we suppose that data D consists of a series of outcomes – for instance, HTTHH would mean there was one head, followed by two tails, followed by two heads -, the probability that that data D would be observed under hypothesis H_i is P(D|H_iX) = p_i^{n_H}q_i^{n_T} where n_H is the number of observed heads and n_T is the number of observed tails. Another way to write it would be P(D|H_iX) = p_i^{n}(1-p_i)^{N-n} where n is the number of heads and N is the total number of tosses. In that, then, for any given hypothesis H_i , the sampling distribution is a function f(D;H_i) = p_i^{n}(1-p_i)^{N-n} of n and N while holding p_i fixed.

However, if you look at P(D|H_iX) as a function of the hypothesis for some fixed dataset D , then it’s called the likelihood \mathscr L(H_i;D) , and although it’s numerically equal to the sampling distribution, it’s a function of the parametre space \{p_0, p_1, p_2, ...\} while holding n and N fixed. Unlike the sampling distribution, it’s not seen as a probability, but rather a numerical function that, when multiplied by some prior and a normalisation factor, becomes a probability. Because of that, constant factors are irrelevant, and any function \mathscr L(H_i;D) = y(D)P(D|H_iX) is equally deserving of being the likelihood, where y(D) is a function exclusively of the data and independent of the hypotheses under consideration.

Now, if you take the ratio of P(H|DX) and P(\bar H|DX) you get

\frac{P(H|DX)}{P(\bar H|DX)}=\frac{P(H|X)}{P(\bar H|X)}\frac{P(D|HX)}{P( D|\bar HX)}

and the prior for D drops out. We call that ratio the odds on the proposition H , O(H|DX)\equiv\frac{P(H|DX)}{P(\bar H|DX)}, and combining both equations we have:

O(H|DX) = O(H|X)\frac{P(D|HX)}{P(D|\bar HX)}

This form has a very nice intuitive meaning, and it’s better for calculating Bayesian updates. In that, if something has probability 0.5, then it has odds 0.5:0.5 or 1:1 (read one-to-one). If something has probability 0.9, then it has odds 9:1 (nine-to-one) and we know immediately that it’s nine times more likely to be true than to be false. Now, since P(A) = 1-P(\bar A) , the odds transformation is just O = \frac{P}{1-P} , and if I have the odds, I can transform it back to probabilities by using P = \frac{O}{1+O} .

And that last term is called the likelihood ratio. Using Yudkowsky’s example:

Let’s say that I roll a six-sided die: If any face except 1 comes up, there’s a 10% chance of hearing a bell, but if the face 1 comes up, there’s a 20% chance of hearing the bell. Now I roll the die, and hear a bell. What are the odds that the face showing is 1? Well, the prior odds are 1:5 (corresponding to the real number 1/5 = 0.20) and the likelihood ratio is 0.2:0.1 (corresponding to the real number 2) and I can just multiply these two together to get the posterior odds 2:5 (corresponding to the real number 2/5 or 0.40). Then I convert back into a probability, if I like, and get (0.4 / 1.4) = 2/7 = ~29%.

Furthermore, if you have more than one hypothesis at stake – suppose if face 1 comes up there’s a 20% chance of hearing the bell, if face 2 comes up there’s a 5% chance of hearing the bell, and if any other face does there’s a 10% chance of hearing the bell -, then you can use extended odds to calculate your posteriors. In this case, before you throw the die, the prior odds are 1:1:4, and the extended likelihood ratio for hearing a bell is 0.2:0.05:0.1 or 4:1:2. If you throw a die and hear a bell, your posterior odds will be 4:1:8, and your posterior probabilities will be 4/13 = 30.77% for face 1, 1/13 = 7.69% for face 2, and 8/13 = 61.54% for any other face. This is much easier than using Bayes’ Theorem directly.

And a final way to look at probabilities is by using the evidence function, e(H|DX) \equiv 10\log_{10}O(H|DX) , which is measured in decibels and cashes out to:

e(H|DX) = e(H|X) + 10\log_{10}\left[\frac{P(D|HX)}{P(D|\bar HX)}\right]

Now, while this doesn’t keep the niceness of odds when dealing with more than two hypotheses, it has a few other advantages of perspective. The first is that it’s additive: if a given hypothesis has prior probability 0.01 or prior evidence of -20dB, and you observe evidence that’s 1,000 times more likely when that hypothesis is true than when it’s false, the evidence shift is 10\log_{10}1,000 = 30\text{dB} and the posterior evidence is -20\text{dB}+30\text{dB}=10\text{dB} which is a posterior probability of 0.91. As new pieces of evidence are added to the mix, we just add and subtract to the evidence thus far collected to arrive at our final conclusions.

The second niceness is the one that shows that 0 and 1 are not probabilities. How much evidence would it take to raise a hypothesis to certainty?

e(H|X) = 10\log_{10}\frac{P(H|X)}{P(\bar H|X)} = 10\log_{10}\frac 1 0 = +\infty

And the symmetric argument shows that the evidence needed for negative certainty is also infinite. Just like positive and negative infinity aren’t real numbers, they’re just representations of the boundlessness of real numbers, so are 0 and 1 just representations of the extreme limits of perfect platonic certainty.

And the final interesting perspective is best seen when looking at this graph of evidence against probability:

image

To get from 10% probability to 90% probability, you just need 20dB of evidence. Well, “just” 20dB means evidence that’s 100 times more likely when your hypothesis is true than otherwise. But look at the extremes of that graph. There are two singularities. The one close to 1 shows that, once you are very sure of your hypothesis, evidence can change your mind only very little. The distance between 0.5 and 0.6 in terms of evidence is much much less than the distance between 0.999 and 0.9999.

The other singularity, however, shows one of the most important things in probability theory: the vast majority of the work needed when proving your hypothesis is in just figuring out which hypothesis is the right one. Think about a phenomenon you want to explain, like gravity. Think about all the possible hypotheses that could explain it. Little angels pushing massive stuff. A witch did it. Invisible ropes tied around every particle. A force. The curvature of spacetime. Take that last hypothesis alone. Saying “spacetime curves” is still a very general statement. How does it curve? What’s the degree of curvature? There’s a huge number of possible equations that could describe the topology of spacetime under the influence of mass.

Hypothesis-space is gigantic, it’s so big you can’t even imagine it. The amount of evidence you need just to point somewhere in it, just to say “the right hypothesis looks like this” or “the right hypothesis is in this area,” is astoundingly huge. Do you see that? Once you’ve actually found the correct hypotheses, the work necessary to become reasonably sure of it is nothing, it’s negligible.

When asked what he would’ve said if, in 1919, Sir Arthur Eddington had failed to confirm General Relativity, Einstein famously replied, “Then I would feel sorry for the good Lord. The theory is correct.” But that’s not just arrogance (27 bits = 80dB of evidence, to translate Yudkowsky’s notation to ours). Just to pinpoint those particular equations to describe Nature, Einstein needed a lot of work. And he probably had much more evidence than strictly required, too, because humans often underestimate the probability shift they should suffer under new evidence and are very inefficient.

Just to find the right answer, you already need to work your butt off. Otherwise, you’re jumping to a 80dB conclusion with much less than 80dB of evidence. The vast majority of the work is there. Incidentally, that’s why it’s so hard to find the elusive Theory of Everything. Just to pinpoint the correct equations we need a lot of work.

After you’ve found the correct hypothesis, confirming it is comparatively a piece of cake.

Don’t be so sure…

One of the great insights of Bayes’ Theorem is the gradation of belief. This is in fact not how most people intuitively reason! Most people have this intuitive feeling of black-and-white, zero-or-one, believe-or-don’t-believe. When they’re thinking about something they want to believe, they think “Does the available evidence allow me to believe it?” and when they’re thinking about something they don’t, they think “Does the available evidence force me to believe it?”

But if you start a more consistent form of reasoning, one that follows Bayes’ rules, then that “binariness” disappears immediately. And one of the simplest and most direct consequences of that is that there is nothing you can reason about that will ever get to be a certain thing. In other words, there are (almost, I’ll get to it in a bit) no propositions A such that P(A)=1 or P(A)=0 .

To see that, let’s take a look at the product rule. For any two propositions A and B while reasoning on background information X :

P(A|BX)P(B|X) = P(A\land B|X) = P(B|AX)P(A|X)

Let’s show that taking any proposition to absolute certainty leads us to nonsensical results. Suppose we want to reason about the propositions A = “The sky is blue.” and B = “I see the sky being green.”

If I’m absolutely sure that the sky is blue, that means P(A|X) = 1 , right? Furthermore, there are a few other conclusions. Reasoning naïvely, if the sky is blue, I couldn’t possibly be seeing it be green; conversely, if I see it as being green, it couldn’t possibly be blue. Therefore, P(A|BX) = P(B|AX) = 0 , and the product rule just says that 0 = 0 , and I learn nothing. Everything is undefined.

Let’s reason in a more mature way. It’s not, in fact, true that it’s impossible for me to see a green sky under the hypothesis that it’s blue, nor that it be blue under the hypothesis that I’m seeing it be green. Maybe I’m wearing yellow-coloured glasses. Maybe I have a rare genetic condition. So let’s say that both P(A|BX) and P(B|AX) are nonzero but very small. In that case, we can use Bayes’ Rule:

P(A|BX) = \frac{P(B|AX)P(A|X)}{P(B|X)}

However, we can reexpress the denominator:

P(A|BX) = \frac{P(B|AX)P(A|X)}{P(B|AX)P(A|X)+P(B|\bar AX)P(\bar A|X)}

If I’m positively certain that the sky is blue, then the above is just:

P(A|BX) = \frac{P(B|AX)*1}{P(B|AX)*1+P(B|\bar AX)*0} = \frac{P(B|AX)}{P(B|AX)} = 1

What does that mean? It means that, if you’re absolutely certain of a proposition, then nothing you could possibly observe would ever change your mind about it. You’d just find new explanations – maybe you developped a rare eye disease while you were asleep, maybe someone is pranking you, maybe you’re hallucinating -, but you’d never ever conclude that you were wrong about the sky being blue. 0 and 1 are not probabilities, in the same way that +\infty and -\infty are not real numbers (in fact, that connection is actually exact, as I’ll show later).

So saying that something you believe has probability 1 (or 0) is equivalent to saying that nothing will ever change your mind. Now, I don’t know about you, but I personally would like there to be some amount of evidence that was enough to change my beliefs.

There are exceptions to that, however. The way Probability Theory as logic is defined, we have that P(\top) = 1 and P(\perp) = 0 . That is to say, the probability of a proposition that’s true has to be 1, and the probability of a proposition that’s false has to be 0. And there is a certain class of propositions that are “absolutely” true or false: tautologies and contradictions.

Contradictions are saying something you know to be false. From propositional logic, we have that A\land \bar A = \perp. That is, a proposition that says a thing and its negation is always false. Thus, they must have the same probability: P(A\land \bar A) = P(\perp) = 0 . But the product rule still applies, and so P(A\land\bar A) = P(A|\bar A)P(\bar A) = 0. Since that has to be true for every possible proposition A, it follows that P(A|\bar A) = 0: the probability of a contradiction is always 0.

Conversely, a tautology is just reaffirming something you already know to be true. We know that A\land A = A . That is, saying the same thing twice doesn’t change anything. From that, P(A\land A) = P(A) . But the product rule applies, from which P(A\land A) = P(A|A)P(A) = P(A) . And this also has to be true regardless of what proposition A may be, which means P(A|A) = 1: the probability of a tautology is always 1.

So, in general, if the propositions you’re reasoning about include a contradiction, they’ll be 0, whereas tautologies will be struck out and add nothing new.

But of course, in real life you don’t really ever reason about “pure” propositions. Your background knowledge will never include “The sky is blue.” or “Peano Arithmetic is sound.” It will only include things like “I believe the sky is blue.” or “It seems to be the case that Peano Arithmetic is sound.” or, in general, for any proposition A, “I think A is true (or false).” You’re always reasoning from inside your head. So even if you do observe something that looks like it ought to be logically forbidden by your background knowledge, that just means your background knowledge is, in fact, wrong.

In real life, 0 and 1 are not probabilities. Ever. That’s all.

What is evidence?

Simply put, evidence is any observation that changes your probability assignments for a given hypothesis. That’s pretty much it. We can further define evidence for a hypothesis as one that makes the hypothesis more likely, and evidence against a hypothesis as one that makes it less likely.

Generally, for some observation to affect your probability estimates of anything, there needs to be some cause-and-effect chain connecting the observation and the hypothesis. For instance, the hypothesis “It has rained.” and the observation “I see the street is wet.” are connected: it rains, the water makes the street wet, then photons coming from somewhere bounce off the street and hit your retinas, which send an electrical signal to your brain provoking you to make the observation.

That’s valid for anything at all. The chain can be even longer: I may call my friend and tell them that it has rained, and so their probabilistic assignment that it has rained will change, because there is a causal chain connecting the rain to them.

And that’s why other people’s information is evidence for stuff. Because there is some cause-and-effect chain between the stuff and them telling you that the stuff is true.

Information travelling between brains amongst honest folk is evidence, just as well as observing the event itself (although it’s not just as much, since the more steps there are in the chain, the lower your probability will be).

And evidence has an interesting mathematical meaning, too. Suppose you have some hypothesis H , and some observation X . X  is evidence about H  if P(H|X) \neq P(H) . But what is P(H|X) ? By Bayes’ Theorem:

P(H|X) = \frac{P(X|H)}{P(X)}P(H)

So for P(H|X) to not be equal to P(H) , we must have that \frac{P(X|H)}{P(X)} \neq 1  which automatically means that P(X|H) \neq P(X) .

Now let’s characterise mathematically evidence for and against a hypothesis.

P(H|X) > P(H) \Leftrightarrow P(X|H) > P(X)

The meaning of the above is that X is evidence for H  if and only if X is more likely to be observed when H is true than otherwise. Likewise:

P(H|X) < P(H) \Leftrightarrow P(X|H) < P(X)

X is evidence against H  if and only if X is less likely to be observed when H is true than otherwise.

But Probability Theory doesn’t deal with hypotheses and observations. It deals exclusively with propositions. So in fact, what H and X actually mean propositionally, in the equations, is:

H : “Hypothesis H is true.”

X : “Observation X has been made.”

Therefore, the probabilities have different meanings:

P(H) : “The probability that hypothesis H is true.”

P(X) : “The probability that observation X will be made.”

P(H|X) : “The probability that hypothesis H is true, given that observation X has been made.”

P(X|H) : “The probability that observation X will be made, given that hypothesis H is true.”

But the rest of the reasoning is perfectly sound (mathematics works by any name you give it).