## Bayesian Networks

I have mentioned before that Bayesian Inference is, in general, intractable. People like Gaussians a lot because many nice results in closed-form can be derived from them, analytical treatments are in general possible and even easy to do, but for more general distributions you need a lot of information in order to perform useful inference.

Consider, for instance, the triparametrised Student’s t-distribution which I talked about in two posts of mine. Unlike the normal distribution, which has very neat closed forms for calculating many aspects of it, we can’t find a closed-form solution for its Maximum Likelihood Estimation or its Maximum A Posteriori value. There are iterative algorithms to calculate these things, sure, but they have to be applied case-by-case.

Even that is not what I mean by “intractable,” though. I think the best example we can use is that of a few discrete binary variables. We know from Bayes’ Theorem that, for any two variables $A$ and $B$, $P(A,B) = P(A|B) P(B)$, whence to infer the value of $A$ after having observed $B$ we need to calculate $P(A,B) / P(B)$.

Now suppose we’re dealing with four binary variables, $A$, $B$, $C$, and $D$, and we have a joint distribution table for them (that is, a table with the values of $P(A,B,C,D)$ and $P(A,B,C,\neg D)$ and $P(A,B,\neg C,D)$ and so on for every possible combination). Then, in order to calculate $P(A|B)$ and perform inference, we need:

\begin{aligned} P(A|B) &= \frac {P(A,B)}{P(B)} \\ &= \frac {P(A,B,C,D) + P(A, B, C, \neg D) + P(A, B, \neg C, D) + P(A, B, \neg C, \neg D)} {P(A, B, C, D) + P(A, B, C, \neg D) + P(A, B, \neg C, D) + P(A, B, \neg C, \neg D) + P(\neg A, B, C, D) + P(\neg A, B, C, \neg D) + P(\neg A, B, \neg C, D) + P(\neg A, B, \neg C, \neg D)} \end{aligned}

(That is the tiniest size $\LaTeX$ can render anything on my blog, and it still wasn’t enough.)

To make the above clearer, I’ll rewrite it:

$P(A|B) = \frac {P(A,B)} {P(B)} = \frac {\sum\limits_{d\in D}\sum\limits_{c\in C} P(A, B, c, d)} {\sum\limits_{d\in D}\sum\limits_{c\in C}\sum\limits_{a\in A} P(a, B, c, d)}$

Where I’m shorthanding $a\in A$ for $a \in \{A, \neg A\}$. We have to sum over all possible values of the other variables.

The numerator, then, is a sum of $4 = 2^2$ terms. The denominator is a sum of $8 = 2^3$ terms. This suggests a general rule, which is correct: if we’re dealing with $n$ binary variables, then simple inference of the value of one of them when the value of another has been observed requires $2^{n-2}$ terms to calculate the numerator and $2^{n-1}$ terms to calculate the denominator. For small values of $n$, this is not too bad, but this calculation takes exponentially many steps to be performed.

Not to mention the exponentially many memory entries necessary to even have the table! As you might’ve noticed, we need to store $2^n - 1$ values (the $-1$ there is because they all must necessarily sum to $1$, so the last value can be deduced from the other ones) in the table. And usually, we’re dealing with much more than $4$ variables, and stuff gets complicated really fast when numbers start growing.

(No, really fast. Like, I’m fairly good at maths and the table of the time it takes to do breadth-first search in a $b=10$ tree as a function of its depth $d$ is still breathtaking:

Consider that we can have wayyyyy more than $14$ variables in an inference problem and you can surely understand why this can be fairly hard.

(Well this is not a completely fair comparison, breadth-first search takes much more time than Bayesian Inference in binary variables, but my point was just how fast exponential complexity is, really, in practical terms.))

And even this analysis is not complete, since full Bayesian treatment of real life would involve a Solomonoff prior over every possible reality, and that is literally uncomputable, amongst other complications even when we ignore the above. Just trying to give you a taste of what I mean.

We need some way of making this easier on us. One such way is a Bayesian Network.

Informally, a Bayesian Network is a graphical model that represents dependencies and lack thereof in a set of variables. I’ll explain this with an example, the universal example to explain Bayes Nets:

Suppose that there are two events which could cause grass to be wet: either the sprinkler is on or it’s raining. Also, suppose that the rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler is usually not turned on).

A Bayes Net treats the above problem with two tools: a graphical model and a few conditional probability tables. For the above example, one possible representation is given by:

So $P(R) = 0.2$, $P(S | R) = 0.01$, $P(W | \neg S, R) = 0.8$, and so on. Note that, since the first and second columns of each table above must sum to $1$ on each line, the table really only has $7 = 2^{3} - 1$ values. That’s exactly as many values as a full joint distribution would have, but this is because this graph is fully connected: every node is connected to every other node.

Suppose the sprinkler’s rain detection system is busted, and it can no longer tell when it’s raining. The above would then change to:

Note that now the full conditional table has one less entry, since the sprinkler’s activation no longer depends on the rain. So the biggest advantage, here, is exactly in the independencies encoded in a Bayes Net. The change becomes more pronounced the more variables we have, but in general inference is easier with a Bayes Net.

Suppose, for instance, that I know the grass is wet, and I want to know whether it’s raining and the sprinkler is off. In a fully general sense, with the joint probability table, I’d have to calculate:

\begin{aligned} P(\neg S, R | W) &= P(\neg S | R, W) P(R | W) \\ &= \frac {P(\neg S, R, W)} {P(W)} \end{aligned}

And that last term would have that annoying sum over all possible values of sprinkler and rain in the denominator. However, with the Bayes Net, I have the conditional probability tables, and I have the independency $P(S | R) = P(S)$, whence:

\begin{aligned} P(\neg S, R | W) &= \frac {P(W | \neg S, R) P(\neg S) P(R)} {P(W)} \end{aligned}

All the values in the numerator are stored directly in the tables, and the denominator can be calculated from the tables of the three variables.

Now, this may seem like not that much of a gain, since the denominator still has exponentially many terms that we need to sum over. However, Bayesian Networks have another property that’s not shown in that table which is the Local Markov Property. Basically, they’re “local,” in the sense that to calculate the probabilities of a node, you only need its values and the values of things directly linked to it.

For example, suppose my Bayes Net is actually like this:

(Before anything, note that this table has $9$ entries, where a fully general joint distribution over four binary variables would have $2^4 - 1 = 15$ entries.)

In this case, the sprinkler doesn’t detect the rain, but rather whether it’s cloudy. Or maybe it doesn’t detect that either, maybe there’s a gardener who sees that it’s cloudy and turns off the sprinkler when that happens. Anyway, what I want you to see is that the grass probability table is still the same size. The new variable, “cloudy,” affects it not at all, because sprinkler and rain are in the way.

So while before we’d have to sum over all possible values of the grass for all possible values of sprinkler, rain, and cloudy in the denominator, with this Bayes Net we can safely ignore cloudy. Yes Bayesian Inference is still exponential in the variables involved, but the independencies encoded in the Bayes Net ensure that “the variables involved” are only those directly linked to whatever we’re interested in. Stuff is strictly local, and there is significant gain in computation.

Of course, there’s still the problem that, even if local, inference is still exponentially complex, and the power of the Bayes Net depends very strongly on how many independencies are encoded by it. But still, it helps quite a bit.

More formally, let $X$ be a set of random variables in the Bayesian sense, and $p(\cdot)$ a joint probability density (or mass) function over $X$. Let $G = (V, E)$ be a directed acyclic graph, where $V$ is the set of vertices of the graph (with the property $|V| = |X|$) and $E$ is the set of directed edges of the graph (with the property that the graph contains no cycles). For every $v \in V$, let $\text{pa} (v)$ be the set of all vertices whose outgoing edges point to $v$ (we call that set the set of parents of $v$). Then $(X, p)$ is a Bayesian Network with respect to $G$ if $p(x)$ can be written as

$p(x) = \prod\limits_{v \in V}p(x_v | x_{\text{pa} (v)})$

where $x_v \in X$ is the unique variable associated with $v$ and $x_{\text{pa} (v)} \in \mathcal P(X)$ is the set of variables associated with the parents of $v$. We will refer to $v$ and $x_v$ as the same thing, for ease of notation.

The above definition entails both discrete, continuous, and hybrid Bayesian Networks, and entails the Local Markov Property as well, which states that every variable in a Bayesian Network is independent of its non-descendants conditional on its parents.

Bayesian Networks are sometimes used in Machine Learning for classification problems, and sometimes even for regression problems. However, very few people actually work with them as, even though they are a significant improvement over fully general Bayesian Inference, exact inference in Bayes Nets is still not very tractable (in fact, it is NP-hard), and even approximate inference is intractable.

Its greatest advantages, in addition to the technical ones, are ease of use and intuitiveness – their meaning tends to be clear at a glance. This in fact is how I mostly use them, as intuition pumps and demonstration tools. However, I will write further in the future about inference in Bayes Nets and its applications for Machine Learning.