## Confidence and Credibility

Three days ago I got slightly drunk with a few friends (two of which were mentioned in a recent post) and one of them and I were trying to explain to the other what the difference between confidence and credible intervals were. Since we were, as mentioned, not exactly sober, that did not go as well as it could have, and besides it’s not exactly the simplest of concepts, and the distinction can be hard to really pinpoint. So, I’m writing this to explain it better.

Very often, when an estimated value is reported, it also comes with a “confidence interval,” which is supposed to say something about where the true value is likely to be. For example, when polling people for their voting intentions, maybe it’s reported that $40\% \pm 5\%$ of voters will vote for Sanders. Now, there are a few problems with this, the biggest of which being that this is a frequentist concept that does not treat probability in the same way we’re intuitively used to.

For all the use its methods see, frequentist interpretations of probability are actually quite counterintuitive. For a frequentist, probability is a sort of limit: the probability that a given event will occur is the limiting frequency with which it will occur should the trial I’m performing be repeated an infinite number of times. As such, there’s no such thing as “the probability that Sanders will win the 2016 election” or “the probability that it will rain tomorrow.” Either it will or it won’t, it’s not like you can repeat the 2016 election an infinite number of times and see how many times Sanders has won.

Bayesianism reflects something more intuitive, that the probability has to do with our uncertainty over what state the world occupies. Cox’s theorem shows that, if you follow a few reasonable-sounding constraints when dealing with your own uncertainty, then it behaves according to the laws of probability. So in that sense, when people talk about probability in their daily lives, their musings approach the Bayesian interpretation much more than the frequentist one.

And this is why there is a lot of misunderstanding about confidence intervals, as reported by papers and even sometimes the media. When someone reports a 95% confidence interval for a value, like how many people are likely to vote for Sanders next year, even the name suggests that we should be 95% confident that the true value will be there. But the more accurate interpretation is a bit subtler than that.

A confidence interval is tied to a procedure that generates it when using some data as input; given one such procedure that generates a 95% confidence interval, if I were to apply it to many different samples that estimate something, the true value of that something would be in that confidence interval 95% of the time.

More formally, let $D$ be an observed dataset we’re using to estimate some parametre $\theta$, and $\gamma$ be the confidence we want ($0.95$ for a 95% confidence interval). Then a confidence interval with confidence level $\gamma$ is given by two random variables $u(D)$ and $v(D)$ such that:

$P(u(D) <\theta < v(D)) = \gamma$

The interpretation here is that there is a process that generates the numbers $u(D)$ and $v(D)$ (the process itself being represented by the functions $u(\cdot)$ and $v(\cdot)$), and if you generate a large number of data samples meant to estimate some value, then the intervals generated will contain that value 95% of the time.

It’s important to note that, to a frequentist, there’s no such thing as “the probability that $\theta$ is in the interval $(u(D), v(D))$.” Either it is, or it isn’t, $\theta$ is a fixed aspect of reality that can’t really be changed, and once an interval has been generated then that’s it. All we can say is that in the limit of an infinity such intervals generated, a fraction $\gamma$ of them will contain $\theta$.

The motivation for this is to make it so that, in the long run of statistical experience, all people using $\gamma$ confidence intervals in all experiments, even if the experiments and procedures themselves are different, will be correct in a fraction of $\gamma$ of the experiments. In other words, the purpose is to minimise the rate of error in statistical practice.

A credible interval corresponds to our intuitive notion of what the confidence interval should be, where we treat the quantity we’re trying to estimate as an unknown parametre and we want to measure our uncertainty. Using notation similar to the above, suppose we have observed the sample as being $d$. Then a credible interval of credibility $\gamma$ (say, 95%) is given by:

$P(u < \theta < v | D = d, X) = \gamma$

Here, we don’t have a procedure that generates the values $u$ and $v$. Rather, we have a posterior distribution $p(\theta | D=d, X)$ (note that I used lower-case $p$ to indicate that it’s a distribution and not a probability value), and then we choose two numbers $u$ and $v$ that make the above equation true, or equivalently the one below:

$\int_u^v p(\theta | D=d, X) d\theta = \gamma$

where we can replace the integral with a summation in the case of a discrete variable. It follows that there’s no unique way of defining a credible interval. Not that there’s necessarily a unique way of defining a confidence interval, exactly, since there’s no unique procedure that can generate one, but a given procedure will always give the same “kinds” of intervals, and for a given sample and procedure there’s only a single interval.

That said, there are a few guidelines on how to choose appropriate $u$ and $v$. An example is choosing the narrowest possible interval that fits the conditions (this is called the highest posterior density interval); another is choosing an interval such that the probability that the value is above the interval is the same as the probability that it is below the interval (the equal-tailed interval); yet another is choosing an interval for which the mean is a central point.

The motivation here, unlike in the case of confidence intervals, is not necessarily to minimise long term error in statistical practice, but rather to accurately reflect an agent’s uncertainty about a thing. Since a posterior distribution contains all the information about a quantity of interest that one currently has, the credible interval is just a more compact way of capturing some of that information in an easy-to-transmit format.

For all my talk that they’re very different, really, not the same at all, they sure still do look an awful lot like they’re the same. And it doesn’t help me that in the specific case of a normal distribution with ignorance priors, which is a very frequent case or assumption about the data, they coincide.

(More specifically, if

• we’re trying to estimate a single parametre
• and the data can be summarised by a single sufficient statistic (a statistic is sufficient with respect to a parametre if it’s a number derived from the data such that no other statistic that can be calculated from it provides any new information about the parametre)
• and the parametre happens to be a location parametre (a parametre $\mu$ is a location parametre if $p(x|\mu) = f(x-\mu)$ for some function $f(\cdot)$) with a uniform prior (i.e. $p(\mu) \propto 1$)
• or the parametre happens to be a scale parametre (a parametre $s$ is a scale parametre if $p(x|s) = g(x/s)$ for some function $g(\cdot)$) with a Jeffrey’s prior (i.e. $p(s) \propto 1/s$)

then the credible interval and the confidence interval will be the same. In other cases, nothing can be said.)

But I think it’s best if we work with a practical example that shows how they differ.

The first and most obvious way in which they differ is that the credible interval takes the prior distribution into account. Like I said above, credible and confidence are really only guaranteed to agree with noninformative priors, and if there’s any prior information not contained by the data themselves, then the credible interval will show that.

(By the way, you can skip this part if it’s too abstract for you, the next example is much clearer.)

If we use the W’s practical example, the observed sample has $n = 25$ samples with mean $\bar\mu = 250.2g$ and the underlying distribution is normal with standard deviation $\sigma = 2.5g$.

The 95% confidence interval is given by $(\bar\mu - 1.96 \sigma / \sqrt n, \bar\mu + 1.96 \sigma / \sqrt n) = (249.22, 251.18)$ and that’s all there is to it. You can’t ask what’s the probability that the true mean is inside this interval: either it is or it isn’t. Now that the procedure has been performed, we’re done. We know that, in the limit of an infinity of samples, the calculated interval will contain the true value 95% of the time, but that’s all.

Now, however, suppose I switch back to my Bayesian goggles and have some prior confidence about the parametre. Suppose my prior distribution for it is $p(\mu | X) = \mathcal N (\mu | 249.8, 10)$. In that case, my posterior distribution will be $p(\mu | D, X) = \mathcal N(\mu | 250.19, 0.24)$ and my 95% credible interval will be $(249.95, 250.43)$ (or, well, this is the narrowest 95% credible interval I can have, amongst all possible credible intervals). The interpretation here is that our subjective uncertainty over $\mu$ is concentrated almost completely in that interval; in the same sense we may say there’s a 95% probability it will rain tomorrow, we may say there’s a 95% probability $\mu$ will be in that interval.

This example’s boring, though. Errybody knows Bayesians take prior beliefs about data into account whereas frequentists don’t; this is Not News. Are there situations where the data themselves give us different answers for the confidence interval and the credence interval?

As a matter of fact, yes. Let’s look at one such example:

A 10-meter-long research submersible with several people on board has lost contact with its surface support vessel. The submersible has a rescue hatch exactly halfway along its length, to which the support vessel will drop a rescue line. Because the rescuers only get one rescue attempt, it is crucial that when the line is dropped to the craft in the deep water that the line be as close as possible to this hatch. The researchers on the support vessel do not know where the submersible is, but they do know that it forms two distinctive bubbles. These bubbles could form anywhere along the craft’s length, independently, with equal probability, and float to the surface where they can be seen by the support vessel.

Let’s call the location of these bubbles $y_1$ and $y_2$. Let’s also call the location of the hatch $\theta$, the value we’re trying to estimate. Since the bubbles need to form along the length of the submarine, we have that $\theta - 5 \leq y_1, y_2 \leq \theta + 5$, naturally (you can have a clearer picture of this in the image below), and since they form independently, their individual likelihood functions are given by:

p(y_i|\theta, X) = \begin{aligned} \begin{cases} \frac 1 {10} &\text{ if }\theta - 5 \leq y_i \leq \theta + 5 \\ 0 &\text{ otherwise} \end{cases} \end{aligned}

Let’s create two new variables, $x_1$ and $x_2$, that are the bubbles ordered by location, so that $x_1 = \min(y_1, y_2)$ and $x_2 = \max(y_1, y_2)$. Then, $\theta - 5 \leq x_1 \leq x_2 \leq \theta + 5$, and we can express the joint likelihood function by their constraints:

p(y_1, y_2 | \theta, X) = \begin{aligned} \begin{cases} \frac 1 {100} &\text{ if }x_1 \geq \theta - 5 \text{ and } x_2 \leq \theta + 5 \\ 0 &\text{ otherwise} \end{cases} \end{aligned}

However, to make that more explicitly a function of $\theta$, we could rearrange the condition so that:

p(y_1, y_2 | \theta, X) = \begin{aligned} \begin{cases} \frac 1 {100} &\text{ if } x_2 - 5 \leq \theta \leq x_1 + 5 \\ 0 &\text{ otherwise} \end{cases} \end{aligned}

We can further rearrange the above if we define the average between the two values, $\bar x = (x_1 + x_2) / 2$, and their distance, $d = x_2 - x_1 = |y_1 - y_2|$ (since by definition $x_2 \geq x_1$). Then:

\begin{aligned}x_2 - 5 &= x_2 - 5 + \frac {x_1} 2 - \frac {x_1} 2 \\ &= \frac {x_2} 2 + \frac {x_2} 2 - 5 + \frac {x_1} 2 - \frac {x_1} 2 \\ &= \frac{x_1 + x_2} 2 - 5 + \frac{x_2 - x_1} 2 \\ &= \bar x - \left( 5 - \frac d 2 \right) \end{aligned}

A similar derivation shows that $x_1 + 5 = \bar x + (5 - d/2)$, so the likelihood can be rewritten yet again:

p(y_1, y_2 | \theta, X) = \begin{aligned} \begin{cases} \frac 1 {100} &\text{ if } \bar x - \left(5 - \frac d 2 \right) \leq \theta \leq \bar x + \left(5 - \frac d 2 \right) \\ 0 &\text{ otherwise} \end{cases} \end{aligned}

The above immediately shows us two things: first, that $\bar x$ is a good point estimate for $\theta$; second, that the greater the distance between $y_1$ and $y_2$, the smaller the likelihood is. This makes intuitive sense: if both bubbles are on top of each other, then the hatch could be anywhere from their position minus five to their position plus five; conversely, if they’re $10m$ apart, that means they were generated at the exact edges of the submarine, and we know the exact position of the hatch.

Now let’s design a 50% confidence procedure; that is, a procedure that, in the limit of being run an infinite number of times, will generate an interval containing the true value of $\theta$ 50% of the time.

Since the two bubbles are generated independently and with a uniform distribution, there’s a 50% probability that each bubble is generated below $\theta$, and a 25% probability that both are; likewise, there’s a 25% probability that both bubbles are generated above $\theta$. Therefore, there’s a 50% probability that one bubble will be generated above $\theta$ and the other below, so the interval with endpoints $(x_1, x_2)$, or equivalently $(\bar x - d/2, \bar x + d/2)$, has a 50% chance of containing $\theta$. That’s a 50% confidence interval, then. I’ll call it the nonparametric interval.

A credible interval would be built from the posterior distribution, however. Supposing our prior was noninformative – which was the entire reason for this exercise – so $p(\theta|X) \propto 1$, our posterior distribution is:

p(\theta | y_1, y_2, X) = \begin{aligned} \begin{cases} 10 - d &\text{ if } \bar x - \left(5 - \frac d 2 \right) \leq \theta \leq \bar x + \left(5 - \frac d 2 \right) \\ 0 &\text{ otherwise} \end{cases} \end{aligned}

Therefore a 50% credible interval centered around $\bar x$ is $(\bar x - \frac 1 2 (5 - \frac d 2), \bar x + \frac 1 2 (5 - \frac d 2))$. I’ll call it the Bayesian interval.

Now let’s look at the difference between them.

I’ll think of two parallel universes. In universe A, $y_1 = 1.5$ and $y_2 = 1$; in universe B, $y_1 = 0.5$ and $y_2 = 9.5$. These two situations are shown below:

The paper I took this from has a few more examples of confidence intervals, but I think the nonparametric one is the simplest to understand and illustrates my explanation very well. This is a case where the prior is completely noninformative and yet the confidence and credible intervals can be quite different.

This example also demonstrates, I think, very clearly the difference between the confidence and credible intervals: when we were talking about the nonparametric interval, we treated the interval as the random variable, and we were talking about the probabilities that a generated interval would contain the true value; in the case of the credible interval, we drew it from the posterior distribution of $\theta$ itself, which was treated as the variable, because to the Bayesian anything can be one.

Furthermore, some thought shows that sometimes our 50% confidence interval can contain the true value with 100% posterior probability. Consider the case where $d > 5$; clearly then the nonparametric interval must contain the true value, because the maximum distance between either bubble and the hatch is $5$, so if the bubbles are more than $5m$ apart the hatch is definitely somewhere between them. Yet that would still fit the definition of a 50% confidence interval, since prior to actually observing the data there was a 50% probability that the generated interval, whatever it might have turned out to be, would have contained the hatch.

So not only do confidence intervals not take prior information into account, sometimes they don’t take even the data themselves into account, and saying that there’s a 50% probability that a 50% confidence interval contains our estimated value can be very misleading.

This entry was posted in Intuitive Mathematics, Mathematics, Probability Theory and tagged , , , , , , , , , , , , . Bookmark the permalink.

### 3 Responses to Confidence and Credibility

1. Professor Frink says:

You mention that in the long run, a confidence interval has the nice property that if you do N experiments, your confidence intervals contain the true value 95% of the time.
Worth noting is that the credible interval has no guarantee at all. It’s coherent, but because of the problem of setting priors, you could coherently create credible intervals that contain the true value much less than 95% of the time.

• pedromvilar says:

I’m not sure under what definition of “coherently” that’s true; unless I had a systematic problem with priors that made me always create unreasonable ones that don’t actually reflect how much information I currently have about the thing, that’s not true in the long run.

Another way to look at it is: in principle, every coherent informative prior is the posterior of some series of processes (likelihoods) that started from a noninformative prior, even if those likelihoods are more qualitative than anything – for instance, just choosing “normal distribution” as the form of a thing is already a prior assumption about it. Therefore, if your prior correctly integrates your “virtual likelihoods,” it should always give results at least as good as those confidence intervals give you; and if it doesn’t, then that’s your own fault.

And finally, like I’ve said, frequentism and Bayesianism answer different questions, and the confidence and credibility intervals are one situation where that’s very clear. Bayesianism tells you how to deal with your own subjective uncertainty based on what information you do have; frequentism tells you how to deal with long-term stochastic data in a stochastic world. There’s nothing wrong with using Confidence Intervals per se; the problem is if one treats them as Credible Intervals, and uses them to answer a completely different question than the one they were designed to answer. Both Credible and Confidence Intervals have their uses, as long as we know which questions they’re answering.