## Artificial Neural Networks

Of the many, many different ideas put forward to explain, model, and experiment with artificial cognition, one of the most practically successful is the Artificial Neural Network. Not because we’ve managed to actually create cognition with one, but because of how useful it is for Machine Learning and Pattern Recognition. If you’ve ever heard of the latest buzz about “Deep Learning,” it has to do with that.

The inspiration, as you might have guessed, are biological neural networks, such as brains. The ANN can be seen as a collection of “neurons,” basic units of calculation, organised in many interconnected layers, which takes a vector of inputs and combines them many times, applying transformations to each such combination, until it reaches the output layer with the output of the computation.

In the above example, we have the inputs $x_1$, $x_2$, and $x_3$. Then, each hidden node $h_i$ combines them (each with its own weight) and applies a (generally nonlinear) transformation on them. After that, each output node $y_i$ does the same, with its own transformation applied, and giving us its result in the end.

The networks don’t have to be fully connected like in the above case, we could remove some arrows. They also don’t have to only have two layers (not counting the inputs), and can have as many layers as we like (though it can be easily shown that if all the transformations are linear then the entire network is equivalent to a single-layer network). And while that might look all interesting and such, at first glance it may not look particularly useful – at least, not without many many layers and thousands upon thousands of neurons, so we could approximate some part of a brain, for instance.

But there are some very interesting results about ANNs. One of the most astonishing is that, if the transformations applied by the nodes fall within a certain class of functions, then we need only two layers in order to be able to approximate any continuous function to arbitrary precision (though how many nodes we’d need for that is undefined), and only three layers to be able to approximate any mathematical function at all (same). And, like I said, ANNs are very useful for Machine Learning, which implies that they can learn these functions.

Let’s dig deeper.

More formally, an Artificial Neural Network is defined by three types of parametres:

1. The interconnection pattern between the different layers of neurons;
2. The learning process for updating the weights of the interconnections;
3. The activation function that converts a neuron’s weighted input to its output activation.

In full generality, let $z_i^{(n)}$ be the output of the $i^{th}$ node of the $n^{th}$ layer. Then it is given by the following recursive relations:

$z_0^{(n)} = 1$

$z_i^{(0)} = x_i$

$z_i^{(n+1)} = f_{i}^{(n+1)}\left(a_i^{(n+1)}\right)$

$a_i^{(n+1)} = \sum_jw_{ji}^{(n+1)}z_j^{(n)}$

where:

• $x_i$ is the $i^{th}$ input node;
• $f_i^{(n+1)}$ is called an activation function for the $i^{th}$ node of the $(n+1)^{th}$ layer of the network;
• $a_i^{(n+1)}$ is a linear combination of all the outputs of the $n^{th}$ layer and the input of the $i^{th}$ node of the $(n+1)^{th}$ layer;
• $w_{ji}^{(n+1)}$ is the weight of the connection between the $j^{th}$ node of the $n^{th}$ layer and the $i^{th}$ node of the $(n+1)^{th}$ layer, except for $w_{0i}^{(n+1)}$, which is the bias of that node, and is multiplied by the constant $z_0^{(n)} = 1$.

We will further define:

• $\textbf x_n$ as the $n^{th}$ input vector, so that $\textbf x_n = (x_{1n}, x_{2n}, ...)^T$ contains the values of each input node $x_i$ at that vector;
• $\textbf X$ as the matrix containing a set of inputs to the network (so the $n^{th}$ column is $\textbf x_n$);
• $\textbf w$ as the vector with all the weights $w_{ji}^{(n)}$ for all possible combinations of $n$, $i$, and $j$;
• $y_i(\textbf x_n, \textbf w)$ as the value of the $i^{th}$ output of the network when evaluated on the input vector $\textbf x_n$ with weight vector $\textbf w$;
• $\textbf y(\textbf x_n, \textbf w)$ as the output vector that contains the values of each output node of the network when evaluated on the input vector $\textbf x_n$ with weight vector $\textbf w$;
• $\textbf Y(\textbf X, \textbf w)$ as the output matrix that contains the values of each output vector, the $n^{th}$ column given by $\textbf y(\textbf x_n, \textbf w)$ and the $i^{th}$ line given by \textbf y_i(\textbf X, \textbf w) is an $n$-dimensional vector with the values of the $i^{th}$ output when each of the $\textbf x_n$ are given as input.

From the above definitions, it follows that, if our ANN has $M$ layers, then $y_{i}(\textbf x_{n}, \textbf w) = z_i^{(M)}[\textbf x_n]$ (where the brackets indicate that this is the value of that node of the network when its input vector is $\textbf x_n$).

The functions $f_i^{(n)}$ define the type-3 parametres of the network, its activation functions. Some of the $w_{ji}^{(n)}$ are constrained to be $0$, and those constraints define which connections are absent from the network, so that all others define connections that do exist; these, together with the activation functions (which can sometimes be the identity and serve as a “skip-layer” function), determine the interconnection pattern between the different layers. So we only need to define the learning process that will turn a fun abstract mathematical object into a powerful regression and classification algorithm.

We will use the convention that each of the activation functions of all nodes of the $n^{th}$ layer is either equal to the same function $f^{(n)}$ or equal to the identity $i(x) = x$ so as to serve as a “skip-layer” function.

In a future post I might explain in more depth what exactly Machine Learning algorithms are concerned with, but until I do, here’s a quick rundown.

Machine Learning is typically concerned with general algorithms that can look at a corpus of data, learn from it, and then be able to make predictions. It can be divided into a few categories, and two of them are what we call “classification” and “regression.”

• In a classification problem, a given datum can belong to one of a few classes, and after observing many data labelled with the correct classes, it’s the algorithm’s task to predict to which class new data will belong;
• Regression problems are similar, except instead of a datum belonging to one of a discrete set of classes, it’s an input to a function. Then, we have many data with the appropriate input-output information, and we want to approximate the function that would generate those pairs.

Neural Nets are awesome at this.

Let’s look at a regression problem, first. In addition to the definitions discussed above, let $\textbf t_n$ be the target vector for input $\textbf x_n$. This target vector is a value that only exists when we’re training our network, and we’re using input-output pairs to train it, so that we know what outputs we “want” a given input to give.

Furthermore, we’ll assume that this output is noisy; in other words, there is some function $\textbf g(\cdot)$ that we’re trying to approximate with our $\textbf y(\cdot, \cdot)$ such that $\textbf t_n = \textbf g(\textbf x_n) + \boldsymbol \varepsilon(\textbf x_n)$ where $\boldsymbol \varepsilon(\cdot)$ is some form of noise that we’re trying to filter out.

As a concrete example, suppose our input vectors $\textbf x_n$ consist in a single real number $x_n$, and the outputs associated with them (i.e. the “target values” $\textbf t_n$) are also single real numbers ($t_n$) distributed according to the little squares in the picture:

Then we can talk about a probability $p(t_n | x_n)$. Since $y(x_n, \textbf w)$ is our approximation to the true function, one possible model for the noisy parametre is:

$p(t_n|x_n, \textbf w, \beta) = \mathcal N(t_n|y(x_n, \textbf w), \beta ^{-1})$

where $\beta$ is the precision of the noise. So, since we know that there’s noise in the data, we’re not really trying to predict the values $t_n$ exactly, but rather somehow to minimise the error we’d be making by using our function $y$. In the more general, multivariate case, we could have

$p(\textbf t_n|\textbf x_n, \textbf w, \beta) = \mathcal N(\textbf t_n | \textbf y(\textbf x_n, \textbf w), \beta^{-1} \textbf I)$

where we’re assuming the noise of each target value for our input vector is independent of the noises of the other values.

We want our network to approximate the function that associates our $\textbf x_n$ with their desired outputs $\textbf t_n$, but we can’t touch the outputs and the inputs themselves; we can only change our weights $\textbf w$. One possible way, then, to minimise the error our network will be making is to take the likelihood function,

$p(\textbf t|\textbf X, \textbf w, \beta) = \prod\limits_{n=1}^N p(\textbf t_n | \textbf y(\textbf x_n, \textbf w), \beta^{-1} \textbf I)$

where $\textbf t$ is the matrix with the target values for each input in the training set, and try to maximise it. That happens to be equivalent to minimising the inverse of the above, or, even better, minimising the negative of the logarithm of the above. Then we have an error function:

$E(\textbf w) = \frac \beta 2 \sum\limits_{n=1}^N ||\textbf y(\textbf x_n, \textbf w) - \textbf t_n||^2 - \frac {NK} 2 \ln \beta + \frac {NK} 2 \ln (2\pi)$

where $K$ is the total number of target values. Since, like I said, the only part under our control are the weights $\textbf w$, what I really need to minimise is the sum-of-squares function:

$E(\textbf w) = \frac 1 2 \sum\limits_{n=1}^N ||\textbf y(\textbf x_n, \textbf w) - \textbf t_n||^2$

If we call the solution to the minimisation of the above $\textbf w_{ML}$ (since this is a Maximum Likelihood solution), we can find the value of $\beta$ for the distribution:

$\frac 1 {\beta_{ML}} = \frac 1 {NK} \sum\limits_{n=1}^N||\textbf y(\textbf x_n, \textbf w_{ML}) - \textbf t_n||^2$

Let’s look at the classification problem now. First of all, suppose I have only two classes, $\mathcal C_0$ and $\mathcal C_1$, to which a given input can belong. My training set consists of input-class pairs, so the target values are $t_n \in \{0, 1\}$. Then we can express the probability that a given input is in class $\mathcal C_1$ by:

\begin{aligned} P(\mathcal C_1|\textbf x_n) &= \frac {P(\textbf x_n | \mathcal C_1) P(\mathcal C_1)} { P(\textbf x_n | \mathcal C_0) P(\mathcal C_0) + P(\textbf x_n | \mathcal C_1) P(\mathcal C_1)} \\ &= \frac {1} {1 + \exp(-a)} \\ &= \sigma(a) \end{aligned}

where $a = \ln{\frac {p(\boldsymbol x|\mathcal C_1)p(\mathcal C_1)} {p(\boldsymbol x|\mathcal C_0)p(\mathcal C_0)}}$ (which is the total evidence for class $\mathcal C_1$) and $\sigma(x)$ is known as the sigmoid function:

Then if the activation function of the output node (singular, since a two-class classifier only needs a single output) is a sigmoid, $f^{(M)}(x) = \sigma(x)$, we can interpret the output of the network itself, $0 \leq y(\textbf x_n, \textbf w) \leq 1$, as the conditional probability that input vector $\textbf x_n$ is in class $\mathcal C_1$ (with the probability that it’s in class $C_0$ given by $1 - y(\textbf x_n, \textbf w)$), and:

$p(t_n | \textbf x_n, \textbf w) = y(\textbf x_n, \textbf w)^{t_n} (1 - y(\textbf x_n, \textbf w))^{1-t_n}$

The negative log likelihood for that, which we can use as our error function, is:

$E(\textbf w) = -\sum\limits_{n=1}^N \{ t_n \ln y(\textbf x_n, \textbf w) + (1-t_n) \ln (1 - y(\textbf x_n, \textbf w)) \}$

In the case of $K$ independent binary classifications, the error function becomes:

$E(\textbf w) = -\sum\limits_{n=1}^N\sum\limits_{k=1}^K \{ t_{nk} \ln y_k(\textbf x_n, \textbf w) + (1-t_{nk}) \ln (1 - y_k(\textbf x_n, \textbf w)) \}$

Now, in the case of $K \geq 2$ different classes the input can belong to, we can use something we call a $1\text{-of-}K$ coding scheme, in which we have $K$ target variables $t_k$, with $t_{k}[\textbf x_n] = 1$ if $\textbf x_n$ is in class $k$ and $0$ otherwise. In that case, the probability that $\textbf x_n$ is in class $k$ is:

\begin{aligned} P(\mathcal C_k | \textbf x_n) &= \frac{P(\textbf x_n|\mathcal C_k) P(\mathcal C_k)} {\sum_j P(\textbf x_n|\mathcal C_j) P(\mathcal C_j)} \\ &= \frac {\exp(a_k)} {\sum_j \exp(a_j)} \end{aligned}

Where $a_k = \ln{\left(P(\textbf x_n|\mathcal C_k) P(\mathcal C_k) \right)}$ and the function above is known as the softmax function. If we follow the same procedure as before, setting the activation function of the $K$ output nodes to the softmax, our error function becomes

$E(\textbf w) = -\sum\limits_{n=1}^N\sum\limits_{k=1}^K t_{nk} \ln y_k(\textbf x_n, \textbf w)$

where $t_{nk} = t_k[\textbf x_n]$.

So now we have a few possible ways to measure the error we’re making when we perform regression or try to classify an input. How do we then train the network parametres $\textbf w$ to minimise that error? In other words, how do we learn?

Well, thinkin’ about the maths of the thing, it’s a matter of optimisation: we need to find a minimum of the error function, a point $\textbf w$ where $\nabla E(\textbf w) = 0$ and $\nabla \nabla E(\textbf w)$ is positive-definite. Except that only finds a local minimum for the function, and in ANNs, there are lots of those.

Suppose, for instance, that the activation functions of a given layer are all even. In that case, if we invert the sign of all of that layer’s weights, we find a network that’s literally the exact same network as before. And in full generality, there are many such symmetries in an arbitrary ANN, which means that for any point $\textbf w$ in our search space there are many other points that are equivalent to it, and in particular there are many equivalent local and global minima. Any learning algorithm for an ANN has to deal with this.

But for now, we’ll suppose it’s been dealt with. Nonlinear optimisation is a fascinating field unto itself, let’s not steal their cake. However, there’s still more our ANN can do for us, to aid in our nonlinear optimisation.

In general nonlinear optimisation needs the derivatives of the function it’s trying to optimise, or at least its gradient, and to find the gradient of the error function $\nabla E(\textbf w)$, we need to find

$\frac {\partial E(\textbf w)} {\partial w_{ji}^{(n)}}$

for all weights $w_{ji}^{(n)}$ of the weight vector $\textbf w$ (since it’s the weights we’ll be tweaking). Applying the chain rule for partial derivatives gives us:

$\frac {\partial E(\textbf w)} {\partial w_{ji}^{(n)}} = \frac {\partial E(\textbf w)} {\partial a_i^{(n)}} \frac {\partial a_i^{(n)}} {\partial w_{ji}^{(n)}}$

Let’s go back to our definition of the input of the $i^{th}$ node of the $n^{th}$ layer of a network:

$a_i^{(n)} = \sum_jw_{ji}^{(n)}z_j^{(n-1)}$

Where the $z_j^{(n-1)}$ are the outputs of the previous layer. Then:

$\frac {\partial a_i^{(n)}} {\partial w_{ji}^{(n)}} = z_j^{(n-1)}$

As for the other term, let’s call it:

$\frac {\partial E(\textbf w)} {\partial a_i^{(n)}} = \delta_i^{(n)}$

Whence:

$\frac {\partial E(\textbf w)} {\partial w_{ji}^{(n)}} = \delta_i^{(n)} z_j^{(n-1)}$

We can easily calculate the $z_j^{(n-1)}$ just by evaluating an input on the network. Now how do we calculate the $\delta_i^{(n)}$?

At the last layer of the network, $M$, the output of the $i^{th}$ node is $\textbf y_i(\textbf X, \textbf w)$ (remember that $\textbf X$ is the matrix with all the training inputs given and $\textbf y_i$ is the $i^{th}$ line of the output matrix $\textbf Y$), and any good error function will have the form $g(\textbf Y(\textbf X, \textbf w), \textbf t)$, i.e. it will be a function of the outputs on a given input (or set of inputs) and the respective(s) target values. Furthermore, from the way the ANN is built, we have that:

$\textbf y_i = \textbf f_i^{(M)}\left( a_i^{(M)}\right)$

So, in general, in a network with $K$ outputs, we’ll have:

\begin{aligned} \delta_i^{(M)} &\equiv \frac {\partial E(\textbf w)} {\partial a_{i}^{(M)}} \\ &= \frac {\partial g\left(\textbf f_1^{(M)}\left( a_1^{(M)}\right) , ...,\textbf f_K^{(M)}\left( a_K^{(M)}\right) , t_1, ..., t_K \right) } {\partial a_i^{(M)}} \\ &= g'_i \textbf f'^{(M)}_i \end{aligned}

Where $g'_i$ is the partial derivative of $g$ with respect to its $i^{th}$ component, whereas $\textbf f'^{(M)}_i$ is the derivative of $\textbf f^{(M)}_i$.

What about the $\delta_i^{(n)}$ in layers that are not the output layer? Well, we can use the chain rule again:

\begin{aligned} \delta_i^{(n)} \equiv \frac {\partial E(\textbf w)} {\partial a_i^{(n)}} = \sum_j \frac {\partial E(\textbf w)} {\partial a_j^{(n + 1)}} \frac {\partial a_j^{(n + 1)}} {\partial a_i^{(n)}} \end{aligned}

where the sum ranges over all nodes $j$ of the $(n+1)^{th}$ layer of the network to which node $i$ of the $n^{th}$ layer sends connections. The first term in the sum of the right-hand side is just $\delta_j^{(n+1)}$. The second, though, will use our definition above:

\begin{aligned} a_j^{(n + 1)} &= \sum_iw_{ij}^{(n + 1)}z_i^{(n)} \\ &= \sum_i w_{ij}^{(n + 1)} f_i^{(n)}\left( a_i^{(n)} \right) \end{aligned}

Whence:

\begin{aligned} \delta_i^{(n)} &= \sum_j w_{ij}^{(n+1)} \delta_j^{(n+1)} f'^{(n)}_i \left( a_i^{(n)} \right) \\ &= f'^{(n)}_i \left( a_i^{(n)} \right) \sum_j w_{ij}^{(n+1)} \delta_j^{(n+1)} \end{aligned}

This suggests a very nifty way of calculating the derivatives of the error function with respect to the weights, so that we can calculate its gradient and learn, and is the basis of the error backpropagation algorithm. A very appropriate name, too. Intuitively, we calculate the $\delta$ of the output nodes, which can be calculated directly and easily; then, we use those to calculate the $\delta$ of the previous layer; and so on, until we have calculated all of them for the entire network and we have the value of whichever derivatives of the error function we might like.

Furthermore, in the cases of the error functions we’re interested in, they’re actually sums of the errors over all the inputs, i.e. $E(\textbf w) = \sum_n E_n(\textbf w)$ where $E_n$ is the error evaluated only on input $\textbf x_n$. This suggests then our algorithm:

Error Backpropagation Algorithm

1. Apply an input vector $\textbf x_n$ to the network and “forward propagate” through the network (i.e. calculate the outputs of all nodes of the network on that vector);
2. Evaluate the $\delta$ of the output nodes;
3. “Backpropagate” the $\delta$ through the network;
4. Use them to calculate the derivatives of the error function.

What you do, after you’ve calculated this value for all inputs, is more general, and there are several different algorithms invented to deal with that, and like I said we fall into the field of nonlinear optimisation, which is definitely not what this post is about. Still, it is quite interesting to note the way networks help with learning and optimisation and stuff.

When I understand a proof of the universal approximation theorem I’ll write a post explaining it, but until then, here’s a link.