## Learning Bayes [part 1]

I have a confession to make.

I don’t actually know Bayesian statistics.

Or, any statistics at all, really.

Shocking, I know. But hear me out.

What I know is… Bayesian theory. I can derive Bayes’ Theorem, and I also can probably derive most results from it. I’m a good mathematician. But I haven’t actually spent any time doing practical Bayesian statistics stuff. Very often a friend, like raginrayguns, tells me about a thing, a super cool thing, or asks me a thing, a super confusing thing, and it sort of goes over my head and I have to really do stuff by scratch to figure out my way. I don’t have the heuristics, I don’t have all the techniques of problem-solving perfectly available to me.

For example, earlier today another friend of mine came up to me and asked me a question about the difference between Bayesian and frequentist statistics. Basically, he has a bunch of data about a lot of bridges, and then four pieces of information about each of them: cost, material, length, and astrological sign of the designer of the bridge. He wanted – I had to ask a lot of questions to figure this out because, as I said, I don’t do statistics yet, I don’t know the jargon – to find the posterior distribution for the cost of his projected bridge, given a material, a length, and his astrological sign. Or rather, he wanted the Bayesian answer, because he knew the frequentist one already.

Let me pause this a bit, and talk about another problem.

Suppose someone wants to estimate my height. They have some prior distribution of heights $p(\text{My height equals h}"|X)$ (which we’ll call $p(h|X)$ for short) given by, say, the distribution of heights in my country of birth.

Then one comes and says, “We could get the height of his father!” And that’s a good idea, so they find out that the height of my father is F, and find the posterior $p(h|F, X)$. So far, so good.

Then another comes and says, “His name is Pedro! We should get data from Pedros!” So they round up 50 Pedros and…

And you don’t feel like this will be very useful at all, do you? You will be in fact discarding about 198 million people’s worth of information to focus on a group of 50 Pedros that has nothing to do with anything.

But if I am one amongst those 50 Pedros, then well, then it becomes pretty cogent information! They no longer care about those other 198 million people in Brazil, since the actual me is in the actual group of Pedros.

Bayes’ Theorem goes:

$p(h|\text{Pedro}, F,X) = p(h|F,X)\frac{p(\text{Pedro}|h, F, X)}{P(\text{Pedro}|F,X)}$

(You’ll notice I used a lower-case $p$ thrice and an upper-case $P$ once. This is a convention: upper-case means an actual number that is a probability, and lower-case means a probability density function.)

• $p(h|F,X)$ is the prior distribution of heights, a function of $h$.
• $p(\text{Pedro}|h, F, X)$ is the likelihood function of Pedro as a function of $h$.
• $P(\text{Pedro}|F,X)$ is just a normalising constant, the probability that I am called Pedro given that my father has height F.

Presumably $p(\text{Pedro}|h, F, X)$ will in fact be a constant or nearly so, and equal to $P(\text{Pedro}|F,X)$, because we suspect that names have only spurious correlations with height when at all. If that’s the case, then presumably the posterior and the prior are exactly equal, because the likelihood “function” cancels the normalising constant out.

But you know it won’t cancel perfectly, especially with a number as small as only 50 people, and you will in fact be throwing information out with that.

Unless I am in the measured group of Pedros, in which case yes we’re throwing information out, but only useless information, and we want to use Bayes’ Theorem just like that.

So in the case where I am in the group, we want the posterior to look just like what Bayes says; but in the case where I’m not, we want the posterior to look exactly the same as the prior.

And of course, the consistency desideratum says that any two ways of calculating a quantity ought to give you the same result. So, suppose we invent a proposition $N : \text{The name is relevant to the distribution}$ (and let’s rename the proposition “His name is Pedro” to just P). We can rewrite our posterior as:

\begin{aligned}p(h|P, F, X) &= p(h|N, P, F, X)P(N|P, F, X) \\ &+ p(h|\bar N,P,F,X)P(\bar N|P,F,X)\end{aligned}

Now, $p(h|N,P,F,X)$ is the posterior distribution you get if you take the sample from the group of Pedros, while $p(h|\bar N,P,F,X)$ is just the prior distribution. Then, intuitively, $P(N|P,F,X)$ is the probability that I am in the group of Pedros.

This looks like what I discussed above, intuitively: if I am in that group of 50 Pedros, then I want to take that sample; if I’m not, then I don’t. And by consistency, this result ought to be equal to the result when applying Bayes. Why isn’t it?

I in fact made a mistake in the analysis of the result of Bayes’ Theorem.

$p(\text{Pedro}|h,F,X)$ is in fact a relevant likelihood function (and it’s most likely a constant). And it’s not given by my sample of 50 Pedros. Rather, $p(\text{Pedro}|h,F,X)$ has to include information from every single Pedro from Brazil. The distribution given by that sample of 50 Pedros is rather $p(\text{He is in this specific group of 50 Pedros}"|h,F,X)$, which is zero if I’m not in that group! I was applying conditionalisation wrong, because I was conditionalising on the wrong thing!

What’s the moral of this story?

First, pay attention to what you’re conditionalising on. Second, make sure your target-variable is actually affected by the others.

Back to the bridge example. Suppose his bridge is made of wood. Should he try to find $p(cost|wood)$ by getting the cost distribution only from bridges made of wood from his prior database?

How about $p(cost|Sagittarius)$? Should he try to find it by getting only the data of bridges built by people of Sagittarius?

If in the former case you felt unsure or tempted to do so, I’m certain in the latter case you’d accuse me of throwing out perfectly valid and useful information down the drain, with no gain whatsoever.

Unless, of course, Omega had inserted the future data from my friend’s own bridge into his database. Then, yeah, using information only from people with the same astrological sign as his would actually improve his info.

His bridge is not in that database. You cannot actually just conditionalise on the material he’s going to use, or on his astrological sign, because that information does not translate well to the cost of a bridge that hasn’t been built yet. Because his not-yet-built bridge is not in that database, so he’s actually making an educated guess based on past projects.

His posterior distribution, when conditionalised on (say) the fact that his bridge is made of wood, is a weighted average of the prior distribution over all bridges and the posterior distribution over all wooden bridges, where the weight is given by the probability that his project will be “similar enough” to other wooden projects.

At least, that’s what it looks like to me.

## Bayesian falsification and the strength of a hypothesis

At the end of my post about other ways of looking at probability, I showed you a graph of evidence against probability. This is the relevant graph:

Looking at this graph was one of the most useful things I’ve ever done as a Bayesian. It shows, as I explained, exactly where most of the difficulty is in proving things, in coming up with hypotheses, etc. Another interesting aspect is the symmetry, showing that, to a Bayesian, confirmation and disconfirmation are pretty much the same thing. Probability doesn’t have a prejudice for or against any hypothesis, you just gather evidence and slide along the graph. Naïvely, the concept of Popper falsification doesn’t look terribly useful or relevant or particularly true.

So whence comes the success of the idea?

## How and when to respect authority

When I discussed the usefulness (or lack thereof) of Aumann’s Agreement Theorem, I mentioned that the next best thing to sharing the actual knowledge you gathered (or mind melding) was sharing likelihood ratios.

But sometimes… you can’t. Well, most of the time, really. Or all the time. Humans do not actually have little magical plausibility fluids in their heads that flow between hypotheses and are kept track of dutifully by some internal Probability Inspector, just like humans do not actually have utility functions. If a Bayesian tells you that they believe a thing with 40% probability… either they’re crazy, or they’re Omega, or they’re giving you a ballpark estimate of subjective feelings of uncertainty.

And then there’s the time when your fellow rationalist… is not actually someone you know. They might be a friend of a friend, or a famous scientist, or just the abstract entity of Science.

## How to prove stuff

A while ago, I wrote up a post that explained what a mathematical proof is. In short, a mathematical proof is a bunch of sentences that follow from other sentences. And when mathematicians have been trying to prove stuff for hundreds of years, well, we’re bound to get fairly good at it. And to develop techniques.

So, then. Given any theory (that is, a set of logical sentences) $\mathcal T$, when a sentence S is a theorem (that is, it can be proven from the theory), we write $\mathcal T\vdash S$. And if we want to prove a thing, it may not be the case that we actually know that the thing is true. Sometimes we just have an intuition that it may be true. Or maybe we know it’s true because some other mathematician has told us it’s true, but we don’t see how. So we need to find a way to do it.

## Agreements, disagreements, and likelihood ratios

The LessWrong community has, as a sort of deeply ingrained instinct/rule, that we should never “agree to disagree” about factual matters. The map is not the territory, and if we disagree about the territory, that means at least one of our maps is incorrect. You will also see us citing this mysterious “Aumann’s Agreement Theorem.”

I wish to explain this. Aumann’s theorem says, broadly speaking, that two rational agents that share the same priors and whose posteriors are common knowledge cannot agree to disagree on any factual matter. You’ll notice that I ominously italicised the words “common knowledge.” This has a good reason to be.

Common knowledge is a much stricter condition than it sounds. Suppose you and I are reasoning about some proposition A. My background knowledge is given by $\mathcal X$ and yours is given by $\mathcal Y$, and we have that $P(A|\mathcal X) = p$ and $P(A|\mathcal Y) = q$. A proposition C is called common knowledge of two agents with respect to some other proposition A if:

1. C implies that you and I both know C.
2. I would have assigned probability $p$ to A no matter what I saw in addition to C.
3. You would have assigned probability $q$ to A no matter what you saw in addition to C.

…this doesn’t sound very useful. When it’s put that way, it’s pretty clear that the theorem is true. I mean, that’s basically saying that, for any proposition $E \in \mathcal X$, I would have that $P(A|CE) = P(A|C) = p$ and something similar would go for you. Since we’re both rational agents, that’d mean that $p = q = P(A|C)$.

| | 1 Comment

## Bayes’ Theorem

Bayes’ Theorem has many, many introductions online already. Those show the intuition behind using the theorem. This is going to be a step-by-step mathematical derivation of the theorem, as Jaynes explained it in his book Probability Theory: The Logic of Science. However, he himself has skipped a bunch of steps, and not always made his reasoning as clear as possible, so what I’m going to do here is elaborate, expand, and explain his steps.

The maths can be quite complex, but I think anyone can follow the ideas. But still, maths cw! So, let’s go, shall we?

The Desiderata

What we want is to create a way to measure our uncertainty of propositions. Or rather, to measure their plausibility. We want to know exactly how sure we are that something is true. We won’t give many constraints, though. We’re trying to be minimal in our axioms here. So the first desideratum is

• Degrees of plausibility are represented by real numbers.

We’re not going to say anything about any upper or lower bounds. We don’t know yet which real numbers should represent certainty. All we know is that real numbers must be used to represent our plausibility.

We will adopt a convention that says that a greater plausibility will be represented by a greater number. This isn’t necessary, of course, but it’s easier on the eye. We shall also suppose continuity, which is to say that infinitesimal increases in our plausibility should yield infinitesimal increases in its number.

And even this axiom is not incredibly intuitive. Sometimes you don’t even have any idea of how plausible a thing is. However, we’re trying to design an optimal method of reasoning, and I think this is a reasonable thing to expect. You frequently have to make decisions based on incomplete information, and there is some meaningful sense in which you think some states-of-the-world are more or less plausible than others, more or less likely to happen. So it’s that meaningful sense we’re trying to capture here.

• Qualitative correspondence with common sense

This is a sort of catch-all axiom, and it’s very important. This axiom is very important. It’s the axiom that says that the meaning of $(A|B)$ is “the plausibility of A, given that B is true.” The argument this proof will try to make is that certain things are desirable of an agent’s reasoning process, and that at the end, we’ll arrive at certain rules. Even if these desirable things can have multiple interpretations, we’re taking a diachronic interpretation – one that says that the proper meaning of things like $(A|B)$ is that B was observed in the past, so after this happened, the new plausibility for A is that. Under the diachronic interpretation, we do prove those rules, which means that violating those rules implies that you violated some of the desiderata.

For instance, we should expect that if we observe evidence in favour of something, that something should be more plausible, and vice-versa; that is, if I have that some event B makes A more likely, then my plausibility $(A|B) > (A)$ and also $(\bar A|B) < (\bar A)$.

So the way this axiom works is: sometimes I will invoke it, and justify it on something one would expect to be fairly reasonable diachronic assumptions for belief-updating. At the end, I will show that these assumptions pin the rules of probability down uniquely, and any agent that reasons in a way that’s not isomorphic to these rules will therefore necessarily be violating at least one of these assumptions.

• Consistency

This is in fact a stronger claim than it looks. This system of measuring probability will have a bunch of properties which we label collectively “consistency,” namely the fact that two ways of arriving at a result should give the same result, every bit of information should be taken into account, and equivalent states of knowledge are represented by the same numbers.

An important point about this is that this assumption is about states of knowledge and not logical status. It may very well be that two propositions are logically equivalent or otherwise connected, but an agent is only constrained by that if they know about this logical link (as I discuss here and here).

And now, believe it or not… we’re done. This is enough for us to find Bayes’ Theorem.

## MIRI paper on logical uncertainty

Talking to raginrayguns again and he mentioned that a month and a half ago, Paul Christiano wrote a paper exactly on the subject of logical uncertainty. While I haven’t finished reading it yet, I’ll publish it here because it’s relevant.

Here you go.

And I forgot to mention one thing in the last post which is relevant. Gaifman, in his paper, states that if in $\mathcal A$ we have that $A\rightarrow B$ then $P(B|\mathcal A) \geq P(A|\mathcal A)$. I’ll quickly show that that’s a theorem of my approach, and, indeed, any similar approach.

Suppose that $A\rightarrow B" \in X$. In that case, then, my approach has that $P(B|AX) = 1$, because the agent knows B is logically implied by A. If that’s the case, then:

\begin{aligned} P(B|X) &= P(B|AX)P(A|X)+P(B|\bar AX)P(\bar A|X) \\ &=P(A|X)+P(B|\bar AX)P(\bar A|X)\\ &\geq P(A|X)\end{aligned}

With equality if and only if either $P(A|X) = 1$ (A is logically certain given X) or $P(B|\bar AX) = 0$ (B is also impossible when A is false, which means it’s logically equivalent to A). So maybe this was fairly obvious to you, but if it wasn’t, now you have that proof in your background list of proofs and theorems!

Posted in Logic, Mathematics, Probability Theory, Rationality | | 2 Comments

## Logical Uncertainty

In the past while, I’ve been talking to a friend about logical uncertainty. Specifically, how do we deal with the fact that we’re not logically omniscient? Usually, when $A \rightarrow B$, we have that $P(B|A) = 1$. But what if we don’t know that $A \rightarrow B$? What if we only have a few hints, some clues, nothing fixed, nothing concrete yet? How do we cope with our limits?

Benja Fallenstein wrote a post that touches upon this. In it, what he did was define a sort of finite “logical universe” with many “impossible possible logical worlds.” The gist of it is that he takes a huge finite list of worlds (where a world is a conjunction of sentences) that’s not contradicted by a much huger but still finite list of theorems, and then those are his impossible possible logical worlds. He then distributes probability uniformly over that list and his decision agent does utility maximisation on it.

Based on that post, Manfred wrote a post and then a Sequence where he tried to make this idea somewhat better defined. He tries to model the limited agent as an unlimited agent with limited inferences. That is, an agent that can only do a limited number of logical inferences (which will then obviously have probability $1$ or $0$), and then have a maximum entropy prior over every other logical sentence. And he says that the way he’s built things, we’re violating the desiderata of probability but…

We’re not? We’re really, really not. See, there is in fact no desideratum that needs logical omniscience, no step in the proof of Bayes’ Theorem that requires that. In fact, reasoning about logical statements is totally consistent with all of our desiderata!

But we’ll get to that in a bit.

## Axioms, logical pinpointing, natural numbers, and mathematical realism

One of the greatest problems people who are getting into hardcore maths face, one of the blocks, is how arbitrary the axioms look. Axioms are the fundamental truths of a mathematical system, the things we do not prove. Whenever people are introduced to axioms, though, the most common feeling is that we should be able to somehow prove all of them and end up with no unproved theorems, no fundamental truths.

Now here’s why this is wrong.

What’s a natural number? Depending on your level of mathematical sophistication, you might not even remember that there’s such a difference. There are numbers that are natural? What are those? And the answer is that the natural numbers are… well, the natural ones. The ones we first think of, as a species, when we think of numbers; the ones that are sort of already in our brain when we’re born.

They’re $1, 2, 3, 4...$ You know, the counting numbers.

Let’s add $0$ to the mix because $0$ is a Very Important Number.

Now, suppose you have a baby mathematician AI. It’s super intelligent, but it has no knowledge. You can communicate with it, to an extent, but it doesn’t know numbers. If you show it scribbles like $0, 1, 2, 3$, it will just stare at you blankly. If you try to point to apples on a table it will tilt its head and think you’re crazy. How do you explain to this super mathematician what numbers actually are, beyond your innate intuition?

Axioms are logical sentences. But they’re not just any logical sentences: they’re logical sentences that pinpoint certain concepts from conceptspace. What does this mean?

Picture some imaginary space where all possible ideas and concepts lie. It’s vastly huge, it’s really unimaginable, but you know that somewhere there you’ll find the natural numbers. They’re a concept, right? So they must be in there somewhere. Now, how do you find the natural numbers there? How do you pinpoint them?

And that’s what axioms do. Axioms are sentences that are true of the things you’re trying to find, and of nothing else. That is, if a thing is a natural number, then the Natural Number Axioms are true of it; if a thing is not a natural number, then at least one of those axioms isn’t. So what are the Natural Number Axioms? What do you tell a baby mathematician AI so that it can find them in conceptspace?

| Tagged , , , , | 4 Comments