Of the many, many different ideas put forward to explain, model, and experiment with artificial cognition, one of the most practically successful is the Artificial Neural Network. Not because we’ve managed to actually create cognition with one, but because of how useful it is for Machine Learning and Pattern Recognition. If you’ve ever heard of the latest buzz about “Deep Learning,” it has to do with that.

The inspiration, as you might have guessed, are biological neural networks, such as brains. The ANN can be seen as a collection of “neurons,” basic units of calculation, organised in many interconnected layers, which takes a vector of inputs and combines them many times, applying transformations to each such combination, until it reaches the output layer with the output of the computation.

In the above example, we have the inputs , , and . Then, each hidden node combines them (each with its own weight) and applies a (generally nonlinear) transformation on them. After that, each output node does the same, with its own transformation applied, and giving us its result in the end.

The networks don’t have to be fully connected like in the above case, we could remove some arrows. They also don’t have to only have two layers (not counting the inputs), and can have as many layers as we like (though it can be easily shown that if all the transformations are linear then the entire network is equivalent to a single-layer network). And while that might look all interesting and such, at first glance it may not look particularly useful – at least, not without many many layers and thousands upon thousands of neurons, so we could approximate some part of a brain, for instance.

But there are some *very interesting results* about ANNs. One of the most astonishing is that, if the transformations applied by the nodes fall within a certain class of functions, then we need only two layers in order to be able to approximate any continuous function to arbitrary precision (though how many nodes we’d need for that is undefined), and only three layers to be able to approximate any mathematical function at all (same). And, like I said, ANNs are very useful for Machine Learning, which implies that they can *learn* these functions.

Let’s dig deeper.

More formally, an Artificial Neural Network is defined by three types of parametres:

- The interconnection pattern between the different layers of neurons;
- The learning process for updating the weights of the interconnections;
- The activation function that converts a neuron’s weighted input to its output activation.

In full generality, let be the output of the node of the layer. Then it is given by the following recursive relations:

where:

- is the input node;
- is called an
*activation function*for the node of the layer of the network; - is a linear combination of all the outputs of the layer and the input of the node of the layer;
- is the
*weight*of the connection between the node of the layer and the node of the layer, except for , which is the*bias*of that node, and is multiplied by the constant .

We will further define:

- as the input
*vector*, so that contains the values of each input node at that vector; - as the matrix containing a set of inputs to the network (so the column is );
- as the vector with all the weights for all possible combinations of , , and ;
- as the value of the output of the network when evaluated on the input vector with weight vector ;
- as the output vector that contains the values of each output node of the network when evaluated on the input vector with weight vector ;
- as the output
*matrix*that contains the values of each output vector, the column given by and the line given by \textbf y_i(\textbf X, \textbf w) is an -dimensional vector with the values of the output when each of the are given as input.

From the above definitions, it follows that, if our ANN has layers, then (where the brackets indicate that this is the value of that node of the network when its input vector is ).

The functions define the type-3 parametres of the network, its activation functions. Some of the are constrained to be , and those constraints define which connections are *absent* from the network, so that all others define connections that do exist; these, together with the activation functions (which can sometimes be the identity and serve as a “skip-layer” function), determine the interconnection pattern between the different layers. So we only need to define the learning process that will turn a fun abstract mathematical object into a powerful regression and classification algorithm.

We will use the convention that each of the activation functions of all nodes of the layer is either equal to the same function or equal to the identity so as to serve as a “skip-layer” function.

In a future post I might explain in more depth what exactly Machine Learning algorithms are concerned with, but until I do, here’s a quick rundown.

Machine Learning is typically concerned with general algorithms that can look at a corpus of data, learn from it, and then be able to make predictions. It can be divided into a few categories, and two of them are what we call “classification” and “regression.”

- In a classification problem, a given datum can belong to one of a few classes, and after observing many data labelled with the correct classes, it’s the algorithm’s task to predict to which class new data will belong;
- Regression problems are similar, except instead of a datum belonging to one of a discrete set of classes, it’s an input to a function. Then, we have many data with the appropriate input-output information, and we want to approximate the function that would generate those pairs.

Neural Nets are *awesome* at this.

Let’s look at a regression problem, first. In addition to the definitions discussed above, let be the *target vector* for input . This target vector is a value that only exists when we’re training our network, and we’re using input-output pairs to train it, so that we know what outputs we “want” a given input to give.

Furthermore, we’ll assume that this output is noisy; in other words, there is some function that we’re trying to approximate with our such that where is some form of noise that we’re trying to filter out.

As a concrete example, suppose our input vectors consist in a single real number , and the outputs associated with them (i.e. the “target values” ) are also single real numbers () distributed according to the little squares in the picture:

Then we can talk about a probability . Since is our approximation to the true function, one possible model for the noisy parametre is:

where is the precision of the noise. So, since we know that there’s noise in the data, we’re not really trying to predict the values *exactly*, but rather somehow to minimise the *error* we’d be making by using our function . In the more general, multivariate case, we could have

where we’re assuming the noise of each target value for our input vector is independent of the noises of the other values.

We want our network to approximate the function that associates our with their desired outputs , but we can’t touch the outputs and the inputs themselves; we can only change our weights . One possible way, then, to minimise the error our network will be making is to take the likelihood function,

where is the matrix with the target values for each input in the training set, and try to maximise it. That happens to be equivalent to minimising the inverse of the above, or, even better, minimising the negative of the logarithm of the above. Then we have an error function:

where is the total number of target values. Since, like I said, the only part under our control are the weights , what I really need to minimise is the sum-of-squares function:

If we call the solution to the minimisation of the above (since this is a Maximum Likelihood solution), we can find the value of for the distribution:

Let’s look at the classification problem now. First of all, suppose I have only two classes, and , to which a given input can belong. My training set consists of input-class pairs, so the target values are . Then we can express the probability that a given input is in class by:

where (which is the total evidence for class ) and is known as the sigmoid function:

Then if the activation function of the output node (singular, since a two-class classifier only needs a single output) is a sigmoid, , we can interpret the output of the network itself, , as the conditional probability that input vector is in class (with the probability that it’s in class given by ), and:

The negative log likelihood for that, which we can use as our error function, is:

In the case of independent binary classifications, the error function becomes:

Now, in the case of different classes the input can belong to, we can use something we call a coding scheme, in which we have target variables , with if is in class and otherwise. In that case, the probability that is in class is:

Where and the function above is known as the *softmax* function. If we follow the same procedure as before, setting the activation function of the output nodes to the softmax, our error function becomes

where .

So now we have a few possible ways to measure the error we’re making when we perform regression or try to classify an input. How do we then train the network parametres to minimise that error? In other words, how do we *learn*?

Well, thinkin’ about the maths of the thing, it’s a matter of optimisation: we need to find a minimum of the error function, a point where and is positive-definite. Except that only finds a *local* minimum for the function, and in ANNs, there are *lots* of those.

Suppose, for instance, that the activation functions of a given layer are all even. In that case, if we invert the sign of *all* of that layer’s weights, we find a network that’s literally the exact same network as before. And in full generality, there are many such symmetries in an arbitrary ANN, which means that for any point in our search space there are many other points that are equivalent to it, and in particular there are many equivalent local *and global* minima. Any learning algorithm for an ANN has to deal with this.

But for now, we’ll suppose it’s been dealt with. Nonlinear optimisation is a fascinating field unto itself, let’s not steal their cake. However, there’s still more our ANN can do for us, to aid in our nonlinear optimisation.

In general nonlinear optimisation needs the derivatives of the function it’s trying to optimise, or at least its gradient, and to find the gradient of the error function , we need to find

for all weights of the weight vector (since it’s the weights we’ll be tweaking). Applying the chain rule for partial derivatives gives us:

Let’s go back to our definition of the input of the node of the layer of a network:

Where the are the outputs of the previous layer. Then:

As for the other term, let’s call it:

Whence:

We can easily calculate the just by evaluating an input on the network. Now how do we calculate the ?

At the last layer of the network, , the output of the node is (remember that is the matrix with all the training inputs given and is the line of the output matrix ), and any good error function will have the form , i.e. it will be a function of the outputs on a given input (or set of inputs) and the respective(s) target values. Furthermore, from the way the ANN is built, we have that:

So, in general, in a network with outputs, we’ll have:

Where is the partial derivative of with respect to its component, whereas is the derivative of .

What about the in layers that are *not* the output layer? Well, we can use the chain rule again:

where the sum ranges over all nodes of the layer of the network to which node of the layer sends connections. The first term in the sum of the right-hand side is just . The second, though, will use our definition above:

Whence:

This suggests a very nifty way of calculating the derivatives of the error function with respect to the weights, so that we can calculate its gradient and learn, and is the basis of the *error **backpropagation algorithm*. A very appropriate name, too. Intuitively, we calculate the of the output nodes, which can be calculated directly and easily; then, we use those to calculate the of the previous layer; and so on, until we have calculated all of them for the entire network and we have the value of whichever derivatives of the error function we might like.

Furthermore, in the cases of the error functions we’re interested in, they’re actually *sums* of the errors over all the inputs, i.e. where is the error evaluated only on input . This suggests then our algorithm:

Error Backpropagation Algorithm

- Apply an input vector to the network and “forward propagate” through the network (i.e. calculate the outputs of all nodes of the network on that vector);
- Evaluate the of the output nodes;
- “Backpropagate” the through the network;
- Use them to calculate the derivatives of the error function.

What you *do*, after you’ve calculated this value for all inputs, is more general, and there are several different algorithms invented to deal with that, and like I said we fall into the field of nonlinear optimisation, which is definitely *not* what this post is about. Still, it is quite interesting to note the way networks *help* with learning and optimisation and stuff.

When I understand a proof of the universal approximation theorem I’ll write a post explaining it, but until then, here’s a link.