I originally wrote this piece as a response to a question on Quora, which proved the most popular response of that thread and was featured in a Quora top picks email digest, so I decided to re-blog it here. Link to my original response on Quora.
An artificial neural network (NN for short) is a classifier. In supervised machine learning, classification is one of the most prominent problems. The aim is to assort objects into classes that are defined a priori (terminology not to be confused with Object Oriented programming). Classification has a broad domain of applications, for example:
- in image processing we may seek to distinguish images depicting different kinds (classes) of objects (e.g. cars, bikes, buildings etc),
- in natural language processing (NLP) we may seek to classify texts into categories (e.g. distinguish texts that talk about politics, sports, culture etc),
- in financial transactions processing we may seek to decide if a new transaction is legitimate or fraudulent.
The term “supervised” refers to the fact that the algorithm is previously trained with “tagged” examples for each category (i.e. examples whose classes are made known to the NN) so that it learns to classify new, unseen ones in the future. We will see how training a NN works in a bit.
In simple terms, a classifier accepts a number of inputs, which are called features and collectively describe an item to be classified (be it a picture, text, transaction or anything else as discussed previously), and outputs the class it believes the item belongs to. For example, in an image recognition task, the features may be the array of pixels and their colors. In an NLP problem, the features are the words in a text. In finance several properties of each transaction such as the daytime, cardholder’s name, the billing and shipping addresses, the amount etc.
It is important to understand that we assume an underlying real relationship between the characteristics of an item and the class it belongs to. The goal of running a NN is: Given a number of examples, try and come up with a function that resembles this real relationship (Of course, you’ll say: you are geeks, you are better with functions than relationships!) This function is called the predictive model or just the model because it is a practical, simplified version of how items with certain features belong to certain classes in the real world. Get comfy with using the word “function” as it comes up quite often, it is a useful abstraction for the rest of the conversation (no maths involved). You might be interested to know that a big part of the work that Data Scientists do (the dudes that work on such problems) is to figure out exactly which are the features that better describe the entities of the problem at hand, which is similar to saying which characteristics seem to distinguish items of one class from those of another. This process is called feature selection.
A NN is a structure used for classification. It consists of several components interconnected and organized in layers. These components are called artificial neurons (ANs) but we often refer to them as units. Each unit is itself a classifier, only a simpler one whose ability to classify is limited when used for complex problems. It turns out that we can completely overcome the limitations of simple classifiers by interconnecting a number of them to form powerful NNs. Think of it as an example of the principle Unite and Lead.
This structure of a combination of inputs that go through the artificial neuron resembles the functionality of a physical neuron in the brain, thus the name. In the following picture the structure of a physical and an artificial neuron are compared. The AN is shown as two nodes to illustrate its internals: An AN combines the inputs and then applies what is called the activation function (depicted as an S-curve), but it is usually represented as one node, as above.
- The inputs of the AN correspond to the dendrites,
- the AN itself (sum + activation) to the body/nucleus and
- the output to the axon.
Moreover, in the brain neurons are connected in networks as well via synapses to the dendrites of neighbouring neurons.
The analogy goes deeper as neurons are known to provide human brain with a “generic learning algorithm”: By re-wiring various types of sensory data to a brain region, the same region can learn to recognize different types of input. E.g. the brain region responsible for the sense of taste can learn to distinguish touching sense input after the appropriate sensory re-wiring. This has been confirmed experimentally on ferrets.
Similarly ANs organized in NNs provide a generic algorithm in principle capable of learning to distinguish any classes. So, going back to the example applications in the beginning of this answer, you can use the same NN principles to classify pictures, texts or transactions. For a better understanding, read on.
However, no matter how deep the analogies feel and how beautiful they are, bear in mind that NNs are just a bio-inspired algorithm. They don’t really model the brain, the functioning of which is extremely complicated and, to a high degree, unknown.
At this point you must be wondering what on earth is an activation function. In order to understand this we need to recall what a NN tries to compute: An output function (the model) that takes an example described by its features as an input and outputs the likelihood that the example falls into each one of the classes. What the activation function does is to take as an input the sum of these feature values and transform it to a form that can be used as a component of the output function. When multiple such components from all the ANs of the network are combined, the goal output function is constructed.
Historically the S-curve (aka the sigmoid function) has been used as the activation function in NNs, in which case we are talking about Logistic Regression units (although better functions are now known). This choice relates to yet another biologically inspired analogy. Before explaining it, let’s see first how it looks (think of it as what happens when you can’t get the temperature in the shower right: first it’s too cold despite larger adjustment attempts and then it quickly turns too hot with smaller adjustment attempts):
Now the bio analogy: brain neurons activate more frequently as their electric input stimulae increases. The relationship of the activation frequency as a result of the input voltage is an S-curve. However the S-curve is more pervasive in nature than just that, it is the curve of all kinds of phase transitions.
As mentioned, a NN is organized in layers of interconnected units (in the following picture layers are depicted with different colors).
- Its input layer consists of a number of units that depends on the number of input features. Features are engineered to describe the class instances, be them images, texts, transactions etc, depending on the application. For example, in an image recognition task, the features may be the array of pixels and their colors.
- Its output layer consisting often of a number of units equal to the number of classes in the problem. When given a new, unseen example, each unit of the output layer assigns a probability that this example belongs to each particular class, based on its training.
- Between the input and output layers, there may be several hidden layers (for reasons briefly described next), but for many problems one or two hidden layers are enough.
Training is often done with the Back Propagation algorithm. During BackProp, the NN is fed with examples of all classes. As mentioned, the training examples are said to be “tagged”, meaning that the NN is given both the example (as described by its features) and the class it really belongs to. Given many such training examples, the NN constructs, during training, what we know by now as the model, i.e. a probabilistic mapping of certain features (input) to classes (output). The model is reflected on the weighs of the units connectors (see previous figure); BackProp’s job is to compute these weighs. Based on the constructed model, the NN will classify new untagged examples (i.e. instances that it has not seen during training), aka it will predict the probability of a new example belonging to each class. Therefore there are fundamentally two distinct phases:
- During training, the NN is fed with several tagged examples from which it constructs its model.
- During testing, the NN classifies new, unknown instances into the known classes, based on the constructed model.
NNs with multiple layers of perceptrons are powerful classifiers (deep neural networks) in that we can use them to model very complex, non-linear classification patterns of instances that may be described by potentially several thousands of features. Depending on the application, such patterns may or may not be detectable by humans (e.g. the human brain is very good in image recognition but is not effective in tasks such as making predictions by generalizing historical data in complex, dynamic contexts).