CNNs, a Quick Guide for Newbies
1.0 Why CNN?
In the last decade, Convolutional Neural Networks (CNNs or ConvNets) have solved complex problems in vision (among them, image classification, object detection, object segmentation, etc.), often outperforming humans.
While these algorithms have traditionally been applied to spatially organized inputs (e.g. images), a recent study has demonstrated that they can also outperform Recurrent Neural Networks (RNNs) on sequential tasks, such as those that are involved in Natural Language Processing (NLP).
These results together with the fact that CNNs are faster than RNN (they can in fact benefit from parallelization) make these algorithms central to AI. Understanding how they work is therefore fundamental for whoever is interested in the field. Let’s try to explore them in this short tutorial for newbies.
Disclaimer: The tutorial has been written quickly and it may contain errors and/or typos. Please feel free to report them or contact me for improvements. Thank you!
2.0 Looking Back
The name Convolutional Neural Networks refers to a class of deep learning algorithms that were initially developed in the 1980s, inspired by the discoveries of Hubel and Wiesel about the cats’ visual cortex.
Receptive Fields in Cats’ Visual Cortex
In 1962, the two Nobel-prize neurologists Hubel and Wiesel noticed that single neurons in cats’ neural cortex fired when specific regions of the visual field were stimulated (see video). They called these neurons receptive fields, and noticed that their neighbor cells had similar or partially overlapping behaviors. In 1968, Hubel and Wiesel identified two types of visual cells, namely simple cells (which fire when specifically orientated edges are recognized in the receptive field) and complex cells (which have larger receptive fields). The neurologists combined these two types of cells in a cascading model for pattern recognition.
In the 1980, Kunihiko Fukushima wrote “Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position”, in which he proposed a neural network to recognize image patterns by relying on two types of layers: convolutional and downsampling layers. In his proposal, the convolutional layers use receptive fields (i.e. weighted tensors) to filter patches of the input, while the downsampling layers use receptive fields to calculate the average of the patches. In Fukushima’s idea, the downsampling layers have the role of generalizing what identified by the convolutional layers, allowing the recognition of complex patterns, even when they are slightly different, shifted, rotated, etc.
In 1989, in “Backpropagation Applied to Handwritten Zip Code Recognition”, Yann LeCunn improved the design of the Neocognitron to use backpropagation, so that the filters could be learned directly from annotated data. This opened the possibility of applying CNNs to a broad range of data and tasks.
In 1993, Weng and colleagues, in “Learning recognition and segmentation of 3-D objects from 2-D images”, suggested to use the max rather than the average function in the downsampling layer, calling their approach max-pooling.
Further improvements in the CNNs design and in their utilization practicies were introduced in a series of papers published between the end of the 1990s and the beginning of 2000s [1, 2, 3]. These algorithms became however considerably popular only in 2012, when Alex Krizhevsky used a deep CNN-based system to win the ImageNet competition.
In 2013, Matt Zeiler and Rob Fergus published “Visualizing and Understanding Convolutional Networks” and showed that deeper layers have larger receptive fields. This means that deeper layers are able to consider more contextual information from the original input tensor, and are therefore able to recognize larger and more complex patterns.
3.0 Dissecting a Convolutional neural network
Given a 2- or 3-dimensional input tensor, CNNs perform three operations: 1) convolutions, which consist of striding filters and saving the sum of the element-wise multiplication between the filter values and the input values; 2) non-linear projection, which project the activations in a non-linear space to de-noise it; and 3) pooling, which downsamples the number of parameters by passing only the average, L2-norm or maximum values for every stride to the next layer. The latter step is meant to reduce the parameters and to make the model more generalizable.
CNNs became popular as an image classification algorithm in 2012 (i.e. AlexNet).
CNNs get in input an n-dimensional tensor (e.g. a matrix) and return in output another m-dimensional tensor (e.g. a matrix). The output of the CNNs is generally passed either to another module (including another CNN) or to a dense layer, which can be finally fed into a SoftMax function for computing the probability distribution over the task labels.
In the image classification tasks, inputs consist of either a 2-dimensional (black and white pictures) or 3-dimensional (colored pictures) tensors. The 2-dimensional tensors can be imagined as matrices, whose values describe the blackness of each pixel (Figure 3). The 3-dimensional tensors do not substantially differ from the 2-dimensional tensors. They only contain an extra dimension (i.e. depth or channels), whose size is generally 3 and which indexes three matrices, called R, B, and G (Red, Blue and Green), each of which contains values describing the respective color intensity of each pixel (Figure 4).
The first type of operation a CNN performs on the input tensor is called convolution and it consists of sliding a filter (or kernel) with the same depth of the input (e.g. one or three channels) along the tensor. Every filter has to be thought as a receptive field, which is sensitive to some specific patterns. In practice, every filter is an array of numbers (or weights or parameters) and, as it slides along the input tensor, it computes the sum of the element-wise multiplication between its values and those in the tensor, producing scalars that are then stored in the output tensor (or activation map or feature map). See Figure 5.
CNNs are defined by a number of hyperparameters. The most important are described here to help the reader to get confident with the convolution. As it can be noticed by looking at Figure 5, both the input tensor and the filter (or kernel) have a size, i.e. 32x32x1 (1 channel) and 5x5x1 (1 channel) respectively. While the input size is defined by the data, the user can choose the filter size. It does not need to be squared, even though this is the most common case.
After calculating the sum of the element-wise multiplication, the filter slides — left-to-right, up-to-down — by m strides, where m can be defined by the user as 1 or a larger integer. The larger the stride, the less overlapping the output scalars will be with each other, and therefore the looser will be the output information. The stride is generally set in a way that allows the filter to examine the full input tensor (e.g. if the stride is too large, the filter may not be able to get closer to the borders or the corners of the input tensor).
Because the filter cannot get ‘out of the road’ (e.g. it cannot partially be outside of the input tensor), the input tensor is often padded (see Figure 6). Padding consists of expanding one or more dimensions of the input vectors with some values (most often zeroes), so that the filter can slide also along the borders and the corners of the input tensor. A way to calculate the padding is by using the formula:
padding = (K-1)/2
where K is the filter size.
The output size of a CNN can be easily calculated with the following equation:
where O is the output height/length, W is the input heigh/length, K is the filter size, P is the padding and S is the stride.
As an exercise, the reader can use these equations to verify that the numbers reported in the caption of Figure 4 are correct.
There is no a principled way to decide the values of the above-mentioned hyperparameters. The choice is generally data-driven and it is aimed at finding the configuration that helps to best abstract and represent the data.
3.3 Non-Linear Projection
After the convolution, a non-linear layer (or activation function) is generally applied to the activation map. The goal of this layer is to remove the noise and amplify the important signals. While in the past tanh and sigmoid were used, in recent years, ReLU is preferred for its computation efficiency (which increases training speed) and its ability to mitigate the vanishing gradient problem:
ReLU(x) = max(0, x)
ReLU turns all negative values to zero, while keeping all positive values as they are (see the paper).
The convolutions and the non-linear projections are generally followed by a downsampling or pooling layer. This layer works pretty similarly to the convolutional layer, with a filter sliding through the input tensor. A major difference is that, instead of calculating the sum of the element-wise multiplications, it returns either the average, the L2-norm or the maximum values of the input tensor region currently covered by the filter.
The goal of this layer is to drastically reduce the input dimension (i.e. length and width change, but not the depth), at the same time reducing the computation cost and making the model able to generalize better (the system becomes in fact less sensitive to variations in the input data).
Similarly to the convolution layers, the pooling layers have a set of hyperparameters to be defined, including filter size, stride length, etc. Differently from them, pooling layers also need to declare the function that needs to be applied, namely average, L2-norm or maximum (the latter being the preferred one). Pooling layers generally have filters of size 2x2xChannels, which slide by 2 strides. The calculation of the output size is identical to the one used for convolution layers, except that there is no padding:
where O is the output height/length, W is the input heigh/length, K is the filter size and S is the stride.
Until now, we did not mention how the filters are initialized. We somehow assumed they were ready and perfectly able to identify patterns. This is not the case. As for any other supervised method, CNNs learn through backpropagation, which consists in four steps:
the forward pass, which consist in producing the output for a given input (information flows forward through the network)
the loss function, which calculates the distance (e.g. with Mean Squared Error) between the output and the ground truth (i.e. annotations)
loss = Sum(1/2 * (ground_truth-output)^2)
the backward pass, which computes the weight gradients using the derivative of the loss function, to finally minimize the loss.
the weight update step, which consists in updating the weights in a way that they change in the opposite direction of the gradient.
The system learns by iterating these four steps across all annotated samples in the training set. This can even happen multiple times (i.e. epochs). The steps are not performed for every example but for a batch of examples, to avoid overfitting. Three hyperparameters that need to be set by the user for training are the number of epochs, the batch size and the learning rate. The latter simply establishes how large the weight updates will be at each iteration.
Although pooling is very helpful to reduce overfitting, other methods also need to be applied to the CNN. Among them, one of the most efficient is the dropout, which consists of randomly turning off some neurons or activations during training, so that the parameters do not get too co-adapted and they need to create new paths for the information to flow (see the paper).
In this tutorial we have quickly seen what CNNs are and how they work. Soon, we will see how they can be applied to text and we will design a model for text classification in Keras and/or Pytorch. Until then, enjoy your exploration.