An (Intuitive?) Introduction to Entropy

There’s a concept in Probability Theory called “entropy,” which borrows a bit from the physical concept of the same name. Intuitively, it’s the degree of “surprise” of an observation, or of a probability distribution. Alternatively, it’s how much information you gain when you observe something, as opposed to something else.

Let’s be more specific.

Suppose I have a biased coin that gives heads 80% of the time and tails 20% of the time. If I toss it and see tails I’ll be more surprised than if I see heads. If I toss it 5 times and see 4 tails and 1 head I’ll be much more surprised than if I see 4 heads and 1 tail. So it seems that whatever definition of entropy I use must reflect that; it should show me a higher number for more surprising observations, i.e. ones with a lower probability. We conclude then that this number is decreasing in the probability of the thing: lower probability means higher entropy.

So, for now, let’s call the entropy of an observation h(\cdot). Then the above condition says that P(a) < P(b) \leftrightarrow h(a) > h(b).

Back to our example. Suppose I observe tails twice. Should I be twice as surprised as if I had observed it once? That sounds reasonable to me, so for now we have that h(a,b) = h(a)+h(b). Since P(a,b)=P(a)P(b) (when a and b are independent), a general form of h(x)=-\log_2P(x) seems to do the trick just fine; the negative sign is to make it an increasing function on the probability, and the base of the logarithm will be explained in a bit (this is a retroactive pun and was completely unintended at the time of writing).

Now that we know how to calculate the entropy of a single observation, what’s the entropy of a distribution? That is, if I toss that coin a bunch of times, how surprised should I expect to be, on average? We can just calculate the expectation of that function \mathbb E[h]=\sum\limits_{x\in X}P(x)h(x)=-\sum\limits_{x\in X} P(x) \log_2 P(x) = H(X) (where X is the random variable that can take the relevant values). This is the average entropy of observations drawn from a distribution, usually just called the entropy of that distribution. In the case of our biased coin, then, we have:

H(\text{coin})=-0.8 \log_20.8 - 0.2 \log_20.2=0.722\text{ bits}

(Bits are short for “binary digits” and are the unit of \log_2 numbers. They’re also the names of those little 0s and 1s in a computer, and this is relevant, too. Now’s when the pun becomes relevant. Har-de-har-har.)

Let’s look at a graph of the entropy, as a function of the probability of heads.

So this function has a maximum when the probability of heads is 0.5. The reason for that is that this is the probability when, on average, we’re the most surprised. In our little example, we’d observe heads 80% of the time. Most of our results aren’t surprising. Only a few of them really surprise us. And by inspection, no matter how many possible observations there are (for instance, throwing a 6-sided die has 6), entropy is maximised when they are all equally probable.

Furthermore, the more possible observations there are, the greater our entropy is. Suppose I throw a fair 6-sided die. The entropy of that throw is 2.585, much greater than the 1 we get with the fair coin.

Suppose, now, that I throw an 8-sided die, and I want to send you a message, as short a message as possible, telling you what my result was. I can identify each possible throw with three symbols:

Throw Symbols
1 000
2 001
3 010
4 011
5 100
6 101
7 110
8 111

So, if I sent you the message 101, you’d know I’d rolled the dice and gotten a 6. If I sent you the message 011001 you’d know I’d gotten a 4 followed by a 2. But it also happens that the entropy of a fair 8-sided die is 3\text{ bits}! Does this have any meaning?

It turns out that it does. Suppose that, instead of it being a fair die, when I roll it there’s a 50% chance of getting a 1, a 25% chance of getting a 2, a 12.5% chance of getting a 3, a 6.25% chance of getting a 4, and a 1.5625% chance of getting each of the other results. The entropy of that distribution is 2\text{ bits}. Keep that in mind.

Now I want to message you about it. One thing I could do is just use the same table as above to encode each of those results, and I’d send you 3 bits all the time to communicate. But I’m a very efficient person, I don’t want to waste bits when I don’t have to. Suppose I used the following table instead:

Throw Symbols
1 0
2 10
3 110
4 1110
5 111100
6 111101
7 111110
8 111111

Wait! Wasn’t I trying to use less bits to communicate? In more than half of that table I use more than 3!

Ah, but how many bits will I use on average? 50% of the time, I’ll use only 1 bit, because I’ll have rolled a 1. Another 25% of the time, I’ll use only 2 bits, because I’ll have rolled a 2. 12.5% of the time I’ll only need 3 bits, because I’ll have rolled a 3. I’ll only use more than 3 bits, really, 12.5% of the time. And in fact, as you may have guessed, the average number of bits I send with the above code is 2, which is exactly the entropy of my distribution!

The code in that table is called a Huffman Code, and there is a theorem that says it’s the most efficient form of encoding information there is: the entropy of a distribution is a lower bound on the average number of bits necessary to send a message about a thing. So, in that sense, we get back to what I said at the beginning about entropy also measuring the information contained in an observation.

This is all fine and dandy for a discrete variable, but what about a continuous one? What if my variable is, say, standard Gaussian, and it has in fact a nonzero probability of being any real number at all? Well, let’s use some calculus.

Suppose I divide the real numbers into tiny discrete bits. For the sake of the argument, let’s say they’re all \Delta; these bits don’t need to have the same size, our argument doesn’t change anyway. So, the probability that my observation will be in one of these little discrete bins, the i^{th}, say, is:

P(x \in i^{th} \text{ bin}) = \int\limits_{i\Delta}^{(i+1)\Delta}p(x)dx=p(x_i)\Delta

Where p is the probability density function and x_i is a number between i\Delta and (i+1)\Delta that makes the last equality true by the mean value theorem. So, the entropy of this distribution is:

H_\Delta = -\sum\limits_ip(x_i)\Delta\log_2(p(x_i)\Delta) = -\sum\limits_ip(x_i)\Delta\log_2p(x_i) - \log_2\Delta

That’s not the entropy of the continuous distribution p(\cdot), though, just of that particular discretisation of it. In order to calculate the continuous version of that, we’d have to take the limit of it as the discrete bins \Delta become arbitrarily small and the number of terms becomes arbitrarily large:

\lim\limits_{\Delta\rightarrow 0}H_\Delta = \lim\limits_{\Delta\rightarrow 0} \sum\limits_i-p(x_i)\Delta\log_2p(x_i) - \log_2\Delta

But see that little pesky term at the end of the definition, \log_2\Delta? When \Delta gets arbitrarily small, that term gets arbitrarily large, and the limit diverges to infinity. For continuous variables, the entropy is infinite. Now that’s a bummer.

But it makes sense, doesn’t it? Because if every value is possible, then you should be infinitely surprised when you observe any one of them! Or, if we take the informational content of entropy, in order to describe which of the infinitely many possible results you’d need an infinite number of bits, to send me an infinite message. For suppose the actual result was 3. You’d have to send me a message that said 3.0000000000... just to guarantee that it’s not 3.00000000001111... or some other number! The informational content of drawing from a continuous distribution is, literally, infinite.

The first term of that limit, though, doesn’t necessarily diverge. Actually, by the definition of the Riemann Integral, that term is just:

\lim\limits_{\Delta\rightarrow 0} \sum\limits_i-p(x_i)\Delta\log_2p(x_i) = \int\limits_{-\infty}^{\infty} -p(x)\log_2p(x) dx

This number is known as the differential entropy of a continuous distribution, and it’s a sort of approximation to the entropy value.

It is not, however, an entropy value. For one, it can be negative. I mean whoever heard of being negatively surprised, or sending negatively many bits to someone else? Still, it’s not without its uses. For example, it just happens that the Gaussian of mean \mu and variance \sigma^2 is the distribution that maximises the differential entropy amongst all distributions with that mean and that variance.

But that’s for another post.

This entry was posted in Mathematics, Probability Theory, Uncategorized and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s