Truth, Probability, and Unachievable Consistency

What is truth?

Many an author has written lengthy philosophical treatises that begin with exactly this question, but, however shaky my identification with the group may be, as a rationalist my first and foremost answer to that question – or my first and foremost interpretation of that question – is a practical one. And to help with it, let’s first ask rather how we identify truth.

Whatever metaphysical definition you might be going with, whether you’re a Tegmark-Level-IV mathematical realist or some form of solipsist, if you ask someone the colour of the sky on a clear cloudless sunny day, they’d probably answer it’s “blue” – unless, of course, you started with a big preamble about “what truth is” or made the conversation seem any more than just a question, in which case they might go off on a philosophical tangent. But for the purposes of this post, let’s assume that’s not what’s happening, and they’ll just answer you it’s blue. If that’s a problem, maybe ask them another question, one that isn’t so obviously tied up with philosophical conundrums, such as “Who is the current President of the United States?” (it’s Barack Obama) or “What’s Brazil’s official language?” (it’s Portuguese).

Regardless of anything else, to a broad first approximation and to all intents and purposes that come up in daily life, we can say that the above answers are true. It’s true that, today, Obama is President of the United States, we speak Portuguese in Brazil, and the sky is blue on a clear sunny day. It’s true that if you walk off a cliff you will fall to your death. So far so good, I am fairly certain there is nothing controversial above these claims.

In practice, we recognise truth in a somewhat positive way. Which is not to take the extreme position that only empirical claims are “cognitively meaningful”; I’m a moral non-realist yet I see meaning in sentences such as “it is (ceteris paribus) wrong to murder people” even if there is no clear or direct empirical verification of the “wrongness” predicate (I mostly see predicates like “wrong” as two-subject predicates, one being the action itself and another being a given moral theory).

But in any case, as I was saying, in practice we use the concept of truth in a positive way. We propose truth based on evidence, and we defend truth based on expectation. A hypothesis is true if, of all mutually exclusive hypotheses, it leads us to expect reality, if it gives us predictions that turn out to be true, if it gives the most probability to what actually happened and will actually happen, if believing it’s true causes you to be less surprised about what you see than otherwise.

This may not be an immediately intuitive definition of “truth.” It’s almost certainly not the first thing most people think of, when they think of truth. But I think it sounds like a reasonable description. If, of literally all possible hypotheses, you have a given one that predicts your observations best, then you probably use that.

Except… not quite? Let’s talk probability (of course).

Continue reading

Posted in Mathematics, Probability Theory, Rationality | Tagged , , , , , , | 3 Comments

What’s logical coherence for anyway?

Time for a writeup! Or something.

So I’ve written before about Logical Uncertainty in a very vague way. And a few weeks ago I wrote about a specific problem of Logical Uncertainty which was presented in the MIRI workshop. I’m gonna reference definitions and results from that post here, too, though I’ll redefine:

Definition 1 (Coherence). A distribution \mathbb P over a set of sentences is coherent if:

  • Normalisation: \mathbb P(\top) = 1
  • Additivity: \forall \varphi,\psi:\mathbb P(\varphi) = \mathbb P(\varphi\land\psi)+\mathbb P(\varphi\land\neg\psi)
  • Non-negativity: \forall\varphi:\mathbb P(\varphi)\geq 0
  • Consistency: \forall\varphi,\psi:\varphi\equiv\psi\rightarrow\mathbb P(\varphi) = \mathbb P(\psi)

The \top symbol stands for a tautology (e.g. 0 = 0), and if \bot stands for a contradiction (e.g. 0 = 1) then normalisation and additivity imply \mathbb P(\bot) = 0, and the whole definition implies \forall\varphi:0\leq \mathbb P(\varphi) \leq 1.

TL;DR: A distribution \mathbb P is coherent if it’s an actual probability distribution and obeys the basic laws of probability (to perform inference you’d define \mathbb P(\varphi|\psi) = \frac{\mathbb P(\varphi\land\psi)}{\mathbb P(\psi)}).

Continue reading

Posted in Computer Science, Logic, Mathematics, Philosophy, Probability Theory | Tagged , , , , , , , , | Leave a comment

The Gaifman Condition and the Π1-Π2 problem

So I’m at a MIRI workshop on Logical Uncertainty, and I’m gonna make a more complete post about it later, but I wanted to talk about a thing that has been on my mind.

So we’re trying to build a probability distribution \mathbb P over logical sentences. Let’s get a few definitions out of the way:

Definition 1 (Coherence). \mathbb P is a coherent distribution if \mathbb P(\phi) = 1 for all proven \phi and \forall \phi,\psi. \mathbb P(\phi) = \mathbb P(\phi \land \psi) + \mathbb P(\phi \land \neg \psi).

Definition 2 (Computable approximability). \mathbb P is said to be computably approximable if there exists some computable function f(x,n) such that \lim\limits_{n\rightarrow +\infty}f('\phi', n) = \mathbb P(\phi).

That is, I can compute successive approximations to the probability of \mathbb P(\phi). This is an interesting definition because any coherent probability distribution is necessarily uncomputable.

Definition 3 (Arithmetical hierarchy). A predicate \phi is \Pi_0 and \Sigma_0 if it contains no quantifiers or only bounded quantifiers. It is \Pi_n (resp. \Sigma_n) if it’s of the form \forall x_0, x_1, x_2, ... \psi(x_0, x_1, x_2, ...) (resp. \exists x_0, x_1, x_2, ... \psi(x_0, x_1, x_2, ...)) where \psi is \Sigma_{n-1} (resp. \Pi_{n-1}).

So basically you should think of the position in the arithmetical hierarchy of formulae as alternating between existential and universal quantifiers n times.

Definition 4 (Gaifman Condition). \mathbb P is said to be Gaifman if \mathbb P(\phi) = 1 for all true \Pi_1 sentences \phi.

This is an interesting condition for a logical probability distribution to obey, but unfortunately…

Theorem 1. No coherent, computably approximable, Gaifman distribution \mathbb P gives nonzero probability to all true \Pi_2 sentences.

Or, in other words, if \mathbb P has the above properties, then there exists some true \Pi_2 sentence \phi with \mathbb P(\phi) = 0. To prove this, we’ll use the following lemma:

Lemma. If \mathbb P is coherent and Gaifman, then all false \Pi_2 sentences have probability 0.

Proof. Suppose \phi is a false \Pi_2 sentence. Then its negation is true, and it’s of the form \exists x_1, x_2, x_3... \psi(x_1, x_2, x_3, ...) where \psi is a \Pi_1 predicate. Take c = (c_1, c_2, c_3, ...) to be the value that makes \psi true. Then \psi( c) is a true \Pi_1 sentence, and by the Gaifman Condition has probability 1. But if that’s so, then that’s a proof that \phi is false, and coherence implies that the probability of this sentence must be 0. □

So now we’re ready to prove the theorem.

Proof of the Theorem. Assume that \mathbb P(\phi) > 0 for all true \Pi_2 sentences \phi. The lemma implies then that for all \Pi_2 sentences, \phi \leftrightarrow \mathbb P(\phi) > 0. However, computable approximability says that \mathbb P(\phi) > 0 \leftrightarrow \lim\limits_{n\rightarrow +\infty} f('\phi', n) >0, and this implies that \exists b,n_0. \forall n > n_0. f('\phi', n) > \frac 1 {2^b}. In other words:

\phi \leftrightarrow \exists b,n_0. \forall n > n_0. f('\phi', n)>\frac 1 {2^b}

Now, \phi is \Pi_2 and the formula on the right-hand side is \Sigma_2, and for all n > 0, not all \Pi_n formulae are equivalent to a \Sigma_n formula, therefore we got to a contradiction.

A clearer/more convincing proof is by diagonalisation. Define:

\phi : \leftrightarrow \forall b, n_0. \exists n > n_0. f('\phi', n) < \frac 1 {2^b}

The right-hand side is a \Pi_2 sentence, and it implies that \mathbb P(\phi) = 0 by computable approximability. Therefore, the above is equivalent to \phi \leftrightarrow \mathbb P(\phi) = 0. □

So we’re kinda screwed here. But then I thought of a weaker form of computable approximability:

Definition 5 (Computable \varepsilon-Approximability). \mathbb P is said to be computably \varepsilon-approximable if there exists some computable function f(x,n) such that |\lim\limits_{n\rightarrow +\infty}f('\phi', n) - \mathbb P(\phi)| < \varepsilon.

This definition seems to me to escape the two proofs of the \Pi_1-\Pi_2 problem.



raginrayguns talked about a sentence similar to the following and Benja Fallenstein proved the theorem:

Theorem 2. Let \mathbb P be coherent, computably \varepsilon-approximable, and Gaifman. Then there exist true \Pi_2 sentences that have probability less than 2\varepsilon.

Proof. We’ll build one such sentence:

\phi : \leftrightarrow \forall n_0.\exists n > n_0. f('\phi', n) < \varepsilon

If \phi is false, then \mathbb P(\phi) \geq \varepsilon, which by the lemma implies \phi is true.

If \phi is true, then \forall n_0. \exists n > n_0. f('\phi', n) < \varepsilon, which implies that \mathbb P(\phi) < 2\varepsilon. □

So we didn’t escape the problem after all, just put it in the bound.

Posted in Logic, Mathematics, Probability Theory | Tagged , , , , , , , | 9 Comments

The Metaphor of the Hills

Imagine that one day you’re standing on a very foggy field of grassy hills. It’s so foggy you can’t see even an inch in front of your nose, and what’s worse, you have no GPS or 4G reception. You decide to try to climb up one of those hills; maybe if you’re higher up you’ll be able to see something, or at least get some GPS signal.

Now, you can’t really see anything, so to start out you have to sort of fumble in the fog. You take a tentative step forward, and that doesn’t change anything. You take another, and you actually get to ground that’s a bit lower than where you were, so you take a step back. You try going left, and there the ground’s higher. On you go, fumbling around, trying to find somewhere with some at least minimal GPS reception.

Eventually you reach the top of one hill. There’s some GPS reception there; it’s not great, it’s in fact pretty crappy, but at least it’s there, unlike down where you started out. Except… what if there’s a higher hill somewhere nearby? You only found the one hill, and the only way to find another would be going back down again, and in this fog, there are no guarantees that you’ll even find your way back up the same hill you just climbed; you might even get lost and end up at a shorter hill, and that’d be pretty awful.

So, in sum, you’re faced with two choices: you can either continue exploring the landscape, or you can try to exploit what you have found. Maybe Google Maps will load in less than an hour, who knows?

You might be thinking now that this is a metaphor for something, and of course you’d be right. I mean, what would be the point of me even mentioning imaginary hill fields? How did you even get there in the first place? And where’d the fog come from? And what earthly carrier are you even using that can’t give you reception in the middle of an open field? Also there’s the title of the post which should be a dead giveaway. So, yeah, metaphor.

Continue reading

Posted in Computer Science, Intuitive Mathematics, Mathematics, Philosophy, Rationality | Tagged , , | 2 Comments

An anti-conjunction fallacy, and why I’m a Singularitarian

When anyone talks about the possibility or probability of the creation/existence of an UFAI, there are many failure modes into which lots of people fall. One of them is the logical fallacy of generalisation from fictional evidence, where people think up instances of AI in fiction and use that as an argument. Another is how the harder a problem is, the faster someone solves it, without spending even five minutes thinking about it. The absurdity heuristic makes an appearance, too.

But someone who’s familiar with LW or the whole cognitive biases shizzaz might be a bit cleverer and argue that most futurists get it wrong and predicting the future is actually really hard (conjunction fallacy). Ozy wrote a post about donating to MIRI in which zie points this out, but in the end mentions talking to, well, yours truly about it, and I think overall there are three points where I disagree with zir.

First, I propose the existence of a fallacy related to the conjunction fallacy and the sophisticated arguer effect, something I’ll call the Anti-Conjunction Fallacy, or perhaps the Disjunction Fallacy, or something. Maybe this is not a direct countercounterargument to Ozy’s point, but it’s a more general countercounterargument to the counterargument that “predicting AIs typically invokes a highly complex narrative with a high Complexity Penalty.”

The Conjunction Fallacy is a fancy name to the idea that sometimes people judge P(A\land B) > P(A), which is to say that a more complex proposition with more details seems to us more probable than a simpler one due to appealing to our sense of narrative. This is a fallacy because it’s a theorem of probability that the exact negation of that sentence is true, no matter what A and B are; that is, it is always the case that P(A\land B) \leq P(A). But conversely, we have that P(A\lor B)\geq P(A), that is, a disjunctive story is more likely than any of its components.

My proposed fallacy is this: many people (particularly rationalists) who see a long tale have an instinct to cry complexity penalty without actually checking whether the logical connective between the elements of that tale is a conjunction or a disjunction, AND or OR, and thus fall into the trap of saying that a disjunctive story has a low probability due to this instinct. And in my experience, most AGI predictions seem to be heavily disjunctive, in that the people making them (such as Nick Bostrom in his book) suggest a myriad possible disjunctive ways a superintelligence could arise, each of which relatively probable given current trends (e.g. whole brain emulations are an active research area which has seen actual results), so the posterior probability of the enterprise as a whole is much higher than that of each of those paths. This is true of many parts of the superintelligence narrative, from its formation to its takeoff to its potential powers. I don’t need five minutes to think of five different ways a superintelligence could reasonably take over the world and I’m not superintelligent.

So the moral of this part here is that, when you see a long prediction about something, first see whether it’s disjunctive or conjunctive before looking for fallacies. Isaac Asimov may have been wrong about the exact picture the future would paint, but by golly a large number of his individual predictions did in fact come true!

My second point is not so much an objection as a sort of reminder about what MIRI is actually doing. I’m not sure what its original goals were, but it most certainly isn’t trying, by itself, to program a superintelligence, at least not right now. Ozy says:

So it seems possible the solution is not independent funding, but getting the entire AGI community on board with Friendliness as a project. At that point, I can assume that they will deal with it and I can return to thinking of technology funding as a black box from which iPhones and God-AIs come out.

The thing is, that is one of MIRI’s explicit goals, outreach about AI dangers. And they seem to be at least mildly successful, or at any rate something was, given that Google created an AI Ethics board when it bought DeepMind, and given the growing number of prominent intellectuals that have been talking about the dangers of AI lately, some of which directly mentioning MIRI.

My third and final objection is that I think zie misunderstood me when I talked about the predictive skill of people who actually build technologies. I didn’t mean that they have some magical insider information or predictive superpowers that allow them to know these things; I meant that when you’re the one building a thing, what you’re doing isn’t predicting as much as it is setting goals. Predicting what Google is going to do is one thing, being inside Google actually doing the things is a whole ‘nother, and when AGI researchers talk about AGI there is frequently an undertone of “even if no one else is gonna do it, I am.” Someone who works at MIRI isn’t concerned so much with the prediction that a superintelligence is possible as they are with their own ability to bring it about, or raise the odds of a good outcome if/when it does.

My last point is something Ozy touched upon and on which I want to elaborate. Zie mentioned AGI is fundamentally different than other “large-scale” projects from before in that, unlike, say, nukes, the way it’s done will severely impact its outcome. As it is, I’d argue that almost no conclusions at all can be drawn from the past funding and development of technological advances because… the sample space is tiny. We can’t judge whether individuals funding research is an effective method of getting that research done because this idea, and the means to do so effectively, are brand new. During the 20th century, most technological advances happened due to the military, but that’s perfectly understandable given the climate: two full wars and a cold one spanning large powers, constant change in political and economic climates…

But large tech companies are a new invention, and it is my impression that, since at least mid-nineties, most of the technological advancements have had at least a hand of the private sector, and this seems to increasingly be the case. I’m not sceptical at all of the ability of individually funded technologies, especially software technologies, to play a large part in the future, because that’s what they’re doing right now, in the present.

But at any rate, there are a number of ways AGI could come about, and MIRI is trying to do what it can. So far, other than that, the FHI, and mmmmaaaaybe Google, it seems no one else is.

Posted in Rationality | Tagged , , , , , | 7 Comments

Alieving Rationality

Almost six years ago, Scott wrote a post on LW about the apparent lack of strong correlation of real-world instrumental success and studying what he calls “x-rationality” – that is, OB/LW-style rationality, of the kind that’s above and beyond the regular skills you can get from being a generally intelligent, thoughtful, and scientifically-minded person.

I’d be quite interested in hearing what his opinion is six years into the future, but my current one is that this situation hasn’t changed much, in general. In fact, I was linked to his post by a recent reply Ozy sent someone on zir blog, while commenting that zie didn’t spread LW memes because zie didn’t feel they were very useful. I’m not alone in this, then. (Let’s remember that CFAR exists now, though.)

I’d like to share my thoughts on another potential factor contributing to this case, something that was alluded to by the post and many commenters of it (including himself and Anna Salamon), something I’ve noticed that… I do. A skill, maybe.

Aspiring x-rationalists are the people who look at the mountain of stuff on Ovecoming Bias, Less Wrong, and other such sources, and decide that it makes sense, that their lives would be improved by the application of these techniques, so they go on and learn everything about it. They memorise it, they absorb all these memes to the point of being able to recite by heart many of the more famous quotes. And yet there isn’t a strong correlation! We’re not producing superheroes every other Tuesday! What gives?

I’d say it’s that believing rationality and alieving rationality are really different things.

Continue reading

Posted in Basic Rationality, Rationality | Tagged , | Leave a comment

On Arrogance

having or revealing an exaggerated sense of one’s own importance or abilities.

A friend of mine once mentioned on a comment written in response to some post or another in a facebook debate group that he had knowledge of maths far above the Brazilian average. That is a simple factual sentence, a true statement (which isn’t exactly surprising given what the Brazilian average actually is). The next few comments called him arrogant.

(ETA: This is an even better example of what I’m talking about here.)

I wonder what goes on in people’s heads when they say something like that. And by “wonder” I mean “sigh exasperatedly at the silliness of rules of etiquette.”

It’s clear, if you look at society and people in general, that people do not like feeling inferior. Not only that, people dislike feeling inferior so much that it’s become a generalised heuristic not to show superiority in any aspect. It’s rude to be seen as better than anyone at anything. It will give you trouble in most social circles. That can probably be easily explained: if you’re superior at something, everyone feels jealous, and stops helping you socially, so you end up being worse off than if you were just average.

It’s okay to want to be better than yourself. But being better than other people? You have to be more humble! How can you possibly think you could actually be better than other people?? That’s incredibly arrogant of you!

Yudkowsky makes a distinction between humility and social modesty: the latter is the kind of social thing you have to show, the “don’t-stick-out” heuristic; the former is actual, real, rational humility, the kind that recognises exactly how sure they are about the outcome of a decision and what steps must be taken in order to minimise the possibility of disaster.

So people calling you arrogant is frequently, in fact, a motte-and-bailey argument. The definition I presented at the top, of a false belief in one’s superiority (or even just a belief in one’s “general superiority” as if that existed), that’s the motte. The bailey is expressing superior aptitude at anything at all without paying your due to social modesty; it’s acknowledging your skills when they’re actually good. How dare you claim you’re better than anyone else? You’re just as flawed and imperfect as all of us! Even if you’re not. You have to pretend you are, just to not commit social suicide.

What I usually say is this: it’s not arrogance if it’s true.

Continue reading

Posted in Basic Rationality, Rationality | Tagged , , , , | Leave a comment

On Magical Universes

Any sufficiently advanced technology is indistinguishable from magic.
— Sir Arthur C. Clarke (1917 – 2008)

The above quote is quite famous, at least amongst certain types of people. And the core idea is a pretty idealistic and hopeful one: technology will one day get so advanced that it will look like magic.

Or maybe it’s actually quite realistic, under another lens. If you brought a peasant from the Middle Ages to the present and showed them fast-moving gigantic flying metal contraptions, thin screens that show people on the other side of the world, and little gadgets that let you scry the past and communicate with your loved ones no matter where they are, the peasant would run away screeching: “WITCHCRAFT!” They wouldn’t run very far, they’d probably be hit by a car, but they’d run alright.

Sufficiently analysed magic is indistinguishable from science (warning: TV Tropes link). This sentence is similar to the quote starting the post, but it’s not nearly as deep or meaningful. Science is, after all, just the method. If a thing exists, then it falls under the scope of science. So if magic exists and works then it can be science’d. Let’s try to science it. Exactly how magical does magic have to be before it goes beyond the boundaries of what’s achievable by technology? Exactly how advanced does technology have to be before it’s far enough from our suspension of disbelief that we’re willing to call it magic?

A more practical question might be: what should you conclude about the universe once you observe magic in it?

Continue reading

Posted in Mathematics, Philosophy | Tagged , , , , , | 20 Comments

Learning Bayes [part 2]

In part 1, I talked about the Bayesian way of dealing with, well, noise, in a certain sense. How do I figure out that I “should not” conditionalise on a person’s astrological sign when predicting the cost of the bridge they’ll build, but that I “should” conditionalise on the bridge’s material without arbitrarily choosing the former to have zero influence and the latter to have “some” influence. This was brought up because a friend of mine was talking about stuff to me. And stuff.

And then that friend commented on the same post explaining that that did not quite get to the heart of what he was looking for. The best way I could find to phrase that was one of differently-parametrised Bayesian model selection. And like I said, I have no training in statistics, so I talked to my friend raginrayguns about it, and after a long time during which we both discussed this and talked and had loads of fun, we (he) sort of reached a conclusion.


I mean, there’s a high probability that we (he) reached a conclusion.

So, suppose we have two possible models, M_1(\textbf a) and M_2(\textbf a, \textbf b) that could explain the data, and these models have a different number of parameters. \textbf a is a vector of the parameters both models have in common, and \textbf b is the vector of the parameters that are present only in the second model.

My friend’s example is the following: he has a scatter plot of some data showing an apparently linear relationship between two variables: Y = α + βX + ε where I suppose ε is normally distributed. Upon closer inspection, however, it looks like there are actually two lines instead of only one! We had two interns, each of whom collected 50 of the samples.

So the common parameters are \textbf a = (\alpha, \beta)^T, and the parameters only the second model has are \textbf b = (\lambda_{\alpha 1}, \lambda_{\beta 1}, \lambda_{\alpha 2}, \lambda_{\beta 2})^T which we’ll call the intern effect. In that case, then, the αs and βs of the second model are going to be seen as the same α and β from the first plus this intern effect.

So, nothing changes. To figure out the posterior probability of the parameters, we’d just use Bayes’ Theorem; same goes for the posterior of the models. But the old frequentist problem of “models with more parameters always have better fit” still remains. How do we get rid of it?

The trick is not judging a model based on its best set of parameters, but rather averaging over all of them. Let’s try this. Suppose the data is represented by d. Then we want the posteriors p(M_1|d, X) and p(M_2|d, X). Or maybe we just want the posterior odds for them. Whichever may be the case, we have:

p(M_1|d, X) = \frac{p(d|M_1, X)}{P(d|X)}p(M_1|X)

And then we can find the probability of the data given a model using the law of total probability:

p(d|M_1, X) = \int p(d|\textbf a, M_1, X)p(\textbf a|M_1, X)d\textbf a

And of course, the same applies for Model 2:

p(d|M_2, X) = \int\int p(d|\textbf a, \textbf b, M_2, X)p(\textbf a|\textbf b, M_2, X)p(\textbf b|M_2, X)d\textbf ad\textbf b

And in these, p(d|\textbf a, M_1, X) and p(d|\textbf a, \textbf b, M_2, X) are just the likelihood functions of traditional statistics. Then, the posterior odds – which are in general much more useful since they just define how much the evidence supports a hypothesis when compared to another instead of in absolute terms – are given by:

O(M_1:M_2|D, X) = \mathcal{LR}(M_1:M_2;D)O(M_1:M_2|X)

Where the likelihood ratio is just:

\mathcal{LR}(M_1:M_2;D) = \frac{\int P(D|\textbf a, M_1, X)p(\textbf a|M_1, X)d\textbf a}{\int\int P(D|\textbf a, \textbf b, M_2, X)p(\textbf a|\textbf b, M_2, X)p(\textbf b|M_2, X)d\textbf ad\textbf b}

Here I’m using capitals for the probability because I’m no longer talking about the specific sampling distribution but rather its value on the observed data D.

And there’s the thing that’s used here and that never ever shows up in frequentist statistics, as usual, which is the prior distribution for… well, everything. We have the prior odds ratio between the two models, and if we’re just interested in which of the two models the data supports better, we still need the priors for the parameters themselves. And this method is, of course, more general than just two models with a core of the same parameters and where one of the models has more parameters than the other; they can have two complete different sets of parameters, and you just do that.

What would a prior for the parameters look like? Of course, it depends on the case. One would expect, perhaps, that in the linear case described by my friend, they’d possibly be normally distributed with mean 0 (no idea which variance), or something. Usually at this point we just throw our arms up and do a non-Bayesian thing: just choose something convenient. Note how I said that this is non-Bayesian. Picking and choosing priors that seem “reasonable” isn’t justifiable in the ideal case, because then the prior would be uniquely determined, but it’s the best we can do in practice, lacking perfect insight into our own states of knowledge.

Alright. So now we have the posterior probability – or at least posterior odds – for the models. So do we just pick the one with the highest posterior and go with it?

Not quite. There are a few problems with this approach. First and foremost, this is a Decision Theory problem, and unless you care only about strict accuracy, there might be value in using a model you know to be wrong because it’s not too wrong and you can still draw useful inferences from it.

For example, suppose that instead of having that the difference between the two possible lines is because of different interns, it happens because a certain effect affects two populations differently. That would mean that, in order to estimate the effect in either population, you would have access to much less data than if you bundled up both populations and pretended they were under the exact same effect. And while this might sound dishonest, the difference between the two models might be small enough that the utility of having on average twice as many points of data is larger than the disutility of using a slightly incorrect model. But of course, there is no hard-and-fast rule about how to choose, or none that’s a consensus anyway, and this is sort of an intuitive choice to balance the tradeoff between fit and amount of data.

Another problem is that, even if you’re willing to actually just pick a model and run with it, maybe there is no preferred model. Maybe the posterior odds of these two models is one, or close enough. What do you do, then? Toss a coin?

No. Bayesian statistics has a thing called a mixed model. We want to make inference, right? So that basically means getting a distribution for future data based on past data: p(d_f|D_p, X). We can, once more, use the law of total probability:

\begin{aligned}p(d_f|D_p, X) &= p(d_f|M_1, D_p, X)P(M_1|D_p, X) \\ &+ p(d_f|M_2, D_p, X)P(M_2| D_p, X) \\ &+p(d_f|\bar{M_1},\bar{M_2}, D_p, X)P(\bar{M_1}, \bar{M_2}|D_p, X) \end{aligned}

If we’re fairly confident that either model 1 or 2 are the appropriate explanations for this data, i.e. P(\bar{M_1}, \bar{M_2}|D_p,X)\approx 0,, then we can use only the two first lines. Even if we don’t, we can still have a good approximation. So we predict future data by taking a weighted average between the predictions of each model where the weight is the posterior probability of those models.

Posted in Mathematics, Probability Theory | Tagged , , , , | 1 Comment

The use of Less Wrong

I’ve been planning on writing a post along these lines, and the recent thing on tumblr about the LW community has given me just the right motivation and environment for it. Specifically this nostalgebraist post gave me the inspiration I needed. He described the belief-content of LW as either obvious, false, or benign self-help advice one can find in many other places.

Now, nostalgebraist isn’t a LWer. I am. So let me say what the belief-content of LW looks like, to me. Why do I think LW-type “rationality” is useful? What’s the use of it all? Is it just the norms of discourse?

And of course you have to take this with a grain of salt. I’m a LWer. So I’m severely biased in favour of it, compared to baseline. And even nostalgebraist is pretty warm towards the community, or at least the tumblr community, so even his opinion is somewhat closer to being positive than baseline. To properly avoid Confirmation, opinions of people who have had bad experiences with LW should be sought. I’ve seen quite a few on tumblr too, but none really outside of tumblr so there’s also the set of biases that come from there. This paragraph is supposed to be your disclaimer: I’m not an objective outside observer. This is the view from the inside, rather, why I personally think LW is useful, and why I (partially) disagree with nostalgebraist.

I think my first problem is: nostalgebraist is smart. And he’s got a certain kind of smarts, one that I find with some frequency in LW, that makes him say stuff like “‘many philosophical debates are the results of disagreements over semantics’ — yeah, we know.” The first point is: we don’t. I don’t know if I’m too used to dealing with people outside of LW, or if he’s too used to dealing with people around as smart as he is, but this sort of thing is not, in fact, obvious. Points like “don’t argue over words” and “the map is not the territory” and “if you don’t consciously watch yourself you will likely suffer from these biases” aren’t obvious! Most people don’t get them! I didn’t get them before I read LW, the vast majority of people I meet (from one of the 100 best engineering schools in the world) don’t know this!

LW-type “insights” are not, in fact, obvious to most people. Most people – and yes I’m including academics, scientists, mathematicians, whatever, people traditionally considered intelligent – do in fact spend most of their lives ignoring this completely. So I’ll get back to what exactly those insights may be later.

The second problem is… I also think he’s objectively wrong about what beliefs are actually common amongst LWers. Just take a look at the 2013 LW Survey Results. In fact, the website itself barely talks about FAI, so I don’t understand where the idea that Singularity-type beliefs are widespread comes from. Maybe it’s because everyone outside of LW doesn’t talk at all about FAI and Singularity and we talk a little about it? I dunno, my personal experience with LW is that much less than 0.5% of the time we spent talking is dedicated to this kind of discussion, and even belief in Singularity/FAI is oftentimes permeated with qualifiers and ifs and buts. And even the hardcore Bayesian thing isn’t all that settled either.

At any rate, there’s much more to it than just that.

Continue reading

Posted in Rationality | Tagged , , , | 3 Comments