Bayes’ Theorem has many, many introductions online already. Those show the intuition behind using the theorem. This is going to be a step-by-step mathematical derivation of the theorem, as Jaynes explained it in his book Probability Theory: The Logic of Science. However, he himself has skipped a bunch of steps, and not always made his reasoning as clear as possible, so what I’m going to do here is elaborate, expand, and explain his steps.

The maths can be quite complex, but I think anyone can follow the ideas. But still, maths cw! So, let’s go, shall we?

**The Desiderata**

What we want is to create a way to measure our uncertainty of propositions. Or rather, to measure their *plausibility*. We want to know exactly how sure we are that something is true. We won’t give many constraints, though. We’re trying to be minimal in our axioms here. So the first desideratum is

- Degrees of plausibility are represented by real numbers.

We’re not going to say anything about any upper or lower bounds. We don’t know yet which real numbers should represent certainty. All we know is that real numbers must be used to represent our plausibility.

We will adopt a convention that says that a greater plausibility will be represented by a greater number. This isn’t necessary, of course, but it’s easier on the eye. We shall also suppose continuity, which is to say that infinitesimal increases in our plausibility should yield infinitesimal increases in its number.

And even this axiom is not incredibly intuitive. Sometimes you don’t even have any idea of how plausible a thing is. However, we’re trying to design an optimal method of reasoning, and I think this is a reasonable thing to expect. You frequently have to make decisions based on incomplete information, and there *is* some meaningful sense in which you think some states-of-the-world are *more* or *less* plausible than others, *more* or *less* likely to happen. So it’s that meaningful sense we’re trying to capture here.

- Qualitative correspondence with common sense

This is a sort of catch-all axiom, and it’s very important. It’s the axiom that says that the meaning of is “the plausibility of A, given that B is true.” The argument this proof will try to make is that certain things are *desirable* of an agent’s reasoning process, and that at the end, we’ll arrive at certain rules. Even if these desirable things can have multiple interpretations, we’re taking one that says that the proper meaning of things like is that you have knowledge that B was observed, so conditional on *that* knowledge, the plausibility for A is that. Under this interpretation, we *do* prove the reasoning rules, which means that violating those rules implies that you violated some of the desiderata.

For instance, we should expect that if we observe evidence in favour of something, that something should be more plausible, and vice-versa; that is, if I have that some event B makes A more likely, then my plausibility and also .

So the way this axiom works is: sometimes I will invoke it, and justify it on something one would expect to be fairly reasonable assumptions for belief-updating. At the end, I will show that these assumptions pin the rules of probability down uniquely, and any agent that reasons in a way that’s not isomorphic to these rules will therefore necessarily be violating at least one of these assumptions.

An interesting feature of these desiderata is that *time isn’t mentioned anywhere in them*. And it shouldn’t! Your reasoning has to be time-independent and in a certain sense objective, and time doesn’t get in *at all*. These rules are about states of uncertainty conditional on knowledge, and thus your reasoning depends exclusively on your *knowledge itself* and not on when it was obtained.

- Consistency

This is in fact a stronger claim than it looks. This system of measuring probability will have a bunch of properties which we label collectively “consistency,” namely the fact that two ways of arriving at a result should give the same result, every bit of information should be taken into account, and equivalent states of knowledge are represented by the same numbers.

An important point about this is that this assumption is about *states of knowledge* and not *logical status*. It may very well be that two propositions are logically equivalent or otherwise connected, but an agent is only constrained by that if they *know* about this logical link (as I discuss here and here).

And now, believe it or not… we’re done. This is enough for us to find Bayes’ Theorem.

**The Product Rule**

So, suppose we have three statements, A, B, and C. We want to find out the plausibility of . That is, after having found out that C is true (or equivalently, assuming that C is your background knowledge), how plausible is it that both A and B are true?

We can follow two paths to find that out. We can first reason about whether B is true, and then having accepted that figure out whether A is true; or we can reason about whether A is true, and then having accepted that figure out whether B is true. By the consistency desideratum, we need these two methods to yield the same result.

That result is the plausibility . So let us reason about this a bit. For this to be true, it is necessary, obviously, that B be true. Therefore, the plausibility should be involved somewhere. Then, if B is true, it is also necessary that A be true, so should be there. But if B is false, then it doesn’t matter what we know about A, AB will be false. Thus if we reason first about B, the plausibility of A will only be relevant if B was found to be true, and we don’t need, after we have and , the plausibility . That would tell us nothing useful.

Also, since , the above argument stays the same if we exchange B and A. Therefore, this whole thing boils down to the existence of some function , where x and y are plausibilities, that takes and , or conversely and , and returns . In other words:

You can also check, if you want, that this is the only form our function can take to return , because any other combinations of , , , and will give you spurious results in some extreme cases, such as when , or , or , or , or stuff like that.

As a concrete example, suppose we suspect our function is actually of the form . Suppose A = “The right eye of the person next to me is brown” and B = “The left eye of the person next to me is black.” Either can be very plausible, so and can be arbitrarily high, but a person with one black and one brown eye is something very rare indeed. This is then an application of the common sense desideratum, and we end up with some function that has the form we defined above.

Now, one of the aspects of the first desideratum was continuity. We wanted that arbitrarily small increases in plausibility meant arbitrarily small changes in the real number. But the same is valid for our operations: we want that arbitrarily small changes in result in arbitrarily small changes in . Not only that, but we want that increases in correspond to increases in , and decreases in correspond to decreases in .

In mathspeak, that is to say, basically, that:

(We don’t need to assume differentiability, but our work will get easier if we do, and the same result will be achieved otherwise anyway, so might as well. We’ll also assume F’s second derivatives are continuous. This can be taken as a common-sense-axiom requirement of “well-behavedness” of the subjective plausibilities.)

Now, suppose I have four propositions, A, B, C, and D, and I want to find out , which is the plausibility of A, B, and C, given that D is true. We know that , so our function must have that:

But we can just reapply this function because and . Thus:

Eek. Looking at this in a computer screen gets really scary. Let’s make this easier to visualise by renaming our plausibilities:

In that case, then, we can rewrite the above as:

Now things are starting to look neater. Our function must have *associativity*. This is yet another expression of our desire that no matter what way we use to get to a result, we’ll always get the same result. The order doesn’t matter, as long as you always consider everything you know and follow the rules neatly.

To make the next steps even easier on our eyes, let’s add two new letters, u and v:

So our function is reduced to

Since those functions have to be identical all the time, we have that their first derivatives must be, too:

(The above is true because of the chain rule, and these and are the partial derivatives of as defined before.)

Let’s define a new function, :

So if we get rid of in the two first equations by dividing the second by the first:

Now, look, the left side depends exclusively on x and y, which means the right side must necessarily be independent of z also. Now, if we multiply both sides of the above equation by we get

As we’ve seen, (1)’s right-hand side , must be independent of z, and so this must mean that:

However, if we take the partial derivative of the right-hand side of (2) with respect to y, we get:

By Schwarz’ Theorem, we have that . This means those two derivatives are the same, and therefore are both . And *this* means that the left-hand side of (2) must be independent of y, too, which is to say that is independent of y.

Then we’re trying to find the most general solution to the above constraints. More specifically, we want the constraint “ is independent of y” to be obeyed by the function whatever it may be. And the most general type of solution for that is

where r is some constant and is some function of x. We’re not specifying what it is, it’s not important right now. It can be shown that every solution must have that form, too. If we replace that in (1) and (2) we end up with these:

Now, recall that . Thus we can take the differential of : . Replacing the two equations above in this differential form, we have that:

Now suppose we create another function:

Then, by manipulating (3) and the function , we arrive at:

And this is, of course, more general than just y and z, and any two variables can be put there. The equality will hold.

Now, remember all the way back, our association rule? ? Or, with our new variables, . We can take the function of both sides: . According to the rule derived above, this will give us

Reminding ourselves that and and replacing in the above, that gives us

It’s clear that the only way for us to have a nontrivial solution (i.e. and ) is to have :

And we can finally replace stuff there at the beginning:

And this is symmetrical with respect to A and B, of course.

Now, let’s do some work. First, suppose A is a direct logical consequence of some part of C, and that knowledge is also contained in C. That is, whenever C is true, so is A, and the agent knows this. That means because the plausibility of AB, given that you know C, is exclusively conditional on the plausibility of B, since A is certain when C is true anyway. Furthermore, since knowing C, A is certain anyway, we have that because no matter what one learns in addition to C, A is already certain. We can see these constraints as another aspect of the common sense axiom. When that is the case, then:

It follows that this function has to represent complete certainty by having that .

Now, suppose A is in fact *impossible* given C. That is to say, if we know C, we know that A cannot be true in any way. Then we have that , because the plausibility of AB given C is wholly determined by A given C. And as above, because once you know C, you know A is impossible, and therefore nothing else you learn will affect your plausibility of A. These constraints are also common-sense-axiom constraints. That means, then, that:

And the values that satisfy this are either or . And indeed, we can work with and , and both and are as good as any other functions to define our work. Therefore, we choose as a *convention* to have , to have this , whatever it is, increase with the plausibility of its statements, as we desired at the beginning.

Now… what exactly *is* this work?

We said that is the real number associated with the plausibility of the proposition A, given that we know C is true. But we don’t need to work with it anymore because we have this weird function which has the rules defined above: , and . This function was defined with regards to a weird and and , which we *didn’t* in fact even define.

And guess what? *We don’t need to.* We can work solely and exclusively with , and it’s good enough to represent our knowledge. And your hunch is probably close to correct. Just wait until the next rule is defined, and we’ll be good to go.

**The Sum Rule**

Now, these propositions about whose plausibility we’re reasoning must follow, according to our original desiderata, common sense. Which means in particular that they must follow basic logic. So the proposition (that is, A is true *and* A is false) must always be false, because no proposition can be true and false at the same time, and therefore ; conversely, the proposition (that is, A is true *or* A is false) must always be true, because every proposition must be either true or false, and therefore .

Not only that, but there must be some relation between the plausibilities and . And if we define

then there must be some function such that . Also, since the actual meaning of A is arbitrary and we can just define any proposition B such that , it must also be true that . That is, the function that relates and is its own inverse: , or .

Moreover, we need and , because since , when is certain (and ), must be impossible (and therefore ), and vice-versa. The common sense axiom constrains us thusly.

Let’s take the plausibilities and . By the product rule, we have that

for whatever propositions A and B. We can rewrite this last equation when :

And because by the definition of we know that and thus

And of course, since A and B are interchangeable in , it must be the case that:

And the above has to hold whenever and aren’t . In particular, it also has to hold when , where D is any other proposition you might like. And in that case, we have that , and that .

And we apply the definition of once again.

To make things easier to see, let’s name a few new symbols:

Then

(One might be interested in the fact that if we set the above is reduced to which agrees with what we’ve already discussed. Of course, this is necessary, and if it hadn’t been the case then some step in our derivation must have had been faulty.)

And you see, what we want to find out here is the shape of this function . What it looks like. So it’s much like the function we had before, and like before, we won’t really care too much about its specifics. It will be a tool in helping us develop our rules.

So we have these constraints that must be obeyed by and we need to figure out a way to find it. One such way would be to study as approaches . Let’s define a new function such that

If we define, then, that , we can create yet another function :

As q approaches positive infinity, approaches , like we were hoping. So if we figure out how behaves in that case, we’ll figure out how behaves at the edges (since it must be symmetric).

So first, let’s take the equation before the one above and invert it:

Now we can use a series expansion of around :

That is to say that, when q is very large, if we approximate using the error we’ll be making will be of order . Since we’re taking q to be approaching infinity, this error is very close to . Now we multiply both sides by :

You’ll see that I didn’t show multiplying the last term. That’s because in the interval of interest, we have that . Therefore, the error still has order at *least* . Now I’ll apply to both sides:

Let’s make a pause to explain this next step. We know that will get exponentially tiny as q gets close to infinity. But then, so will . Let’s do a bit of Calculus:

This is the definition of a derivative. So, for sufficiently tiny d, we have that . And you know what’s sufficiently tiny? is sufficiently tiny. So if we take the case where and , we can approximate our equation by:

Therefore:

So if you divide the whole thing by x you get:

Do you remember that , though? In that case, if we differentiate both sides, we get that . Using this fact on the above:

With some manipulation we can make this better for our purposes:

If we invent a variable given by

With this variable we have:

So we have a few relevant equations. The above and the following three

can be used by us. If we replace some stuff:

Now, by algebra and substitution:

So replacing the from two equations above with this we get:

But the thing on the left-hand side is just so:

If we take the natural logarithm of both sides we end up with:

That last term, though… it’s not very helpful. However, we can use that trick with a differential from before, for sufficiently small d, where is , x is , and d is :

And if we go one step further and apply this method again *on* the logarithm, we get:

Using the chain rule:

But the first term of the above is just , so what we have now is:

If we subtract from both sides and multiply by -1:

The last term, ugly as it is, goes to as fast as when q approaches infinity; so does the last but one term. This means we can rewrite the above as

From that we conclude that has asymptotically linear behaviour, because other than that order-of- term, that difference is independent of q.

Therefore:

Which means that

and b is just some positive constant (it has to be positive because has to have the same sign as the thing on the right-hand side of the above). By the definition of :

Or, in other words:

Now let’s define . Then:

Rearranging those terms:

So we have a differential equation which we can solve:

We have the boundary conditions and :

Both of them impose the same condition: . We have finally found our function S(x):

However, we found this function by assuming that B had a special value such that for some D, and also that and are nonzero, which means that the above function is a *necessary* but maybe not sufficient condition for the consistency requirement given by

Let’s see what happens if we replace S in the above.

Now if we apply the product rule:

So this S(x) is in fact a necessary and sufficient condition for consistency and we’re done.

**The rules of probability**

If you recall the definition of S(x), we had to have . Therefore:

And it doesn’t matter what value of n we pick (as long as it’s positive). Furthermore, our product rule works equally well with that:

So if we pick some value for n (let’s say, n = 1), we can just define the function:

This function, then, is our probability function, with the properties

And from those the other sum rule is derivable:

All from our simple desiderata from the beginning of the article. So, as discussed there, any agent that reasons in a way that’s not isomorphic to these rules is necessarily violating one or more of the presented desiderata; and conversely, any agent that follows them has a probability function. Neat, huh?

Pingback: Logical Uncertainty | An Aspiring Rationalist's Ramble

Pingback: Learning Bayes [part 1] | An Aspiring Rationalist's Ramble

Pingback: Truth, Probability, and Unachievable Consistency | An Aspiring Rationalist's Ramble

Pingback: Confidence vs. Credibility | An Aspiring Rationalist's Ramble