An introduction to Convolutional Neural Networks for image upscaling using Sub-pixel CNNs

Guilherme Leite
8 min readNov 29, 2020


Convolutional Neural Networks

Convolutional neural networks are Deep Learning algorithms commonly used in image recognition and natural language processing . Their architecture is inspired by the organization of neurons on the human visual cortex which makes them very good at picking up on patterns from the input images.

The reason why CNNs are preferred when dealing with image processing is due to the amount of parameters they use. If we consider an input with dimensions of 1920x1080 we’d have over 2 million pixels with usually three color channels each, in most deep neural networks this would take an enormous amount of processing and the output wouldn’t be as good, since the data would be flattened in an one-dimensional array, which results in some loss of information from the original image.

How it works

Convolutional Networks essentially reduce the input images to a smaller matrix, while keeping its original features to guarantee a better final result. To achieve that, the input is fed into a Convolutional layer and Pooling layers and finally to a Classification layer also known as Fully Connected Layer. Unlike most other neural networks, neurons in CNNs all have the same weight and are generally not all connected between layers.

CNN layers Ref:

Convolutional Layer

To reduce the size of the input, filters called kernels are applied to the image, they generally have the size 3x3 or 5x5 and extract the high-level features like edges or apply transformations such as blur. These operations can cause the convolved feature to increase or maintain the same dimensions with Same Padding or decrease using Valid Padding, The kernel moves through the entire image in the pattern shown below:

Movement of the Kernel Ref:

Convolutional neural networks can use an array of filters to highlight features from the input image which enable the categorization.

Application of the Kernel in the Convolutional layer Ref:

The first layers capture smaller and simpler features, like edge detection. As you combine multiple layers and filters, the network can start detecting more complex features.

Each kernel detects a specific feature from the input image. A feature map is created by neurons using the same kernel and it changes during model teaching with a minimized loss function. These operations are passed to a ReLU function where the layers and feature maps merge and with backpropagation the values in the filter matrices are updated.

Pooling Layer

The Pooling layer is used to reduce the dimensions of the Convolutional Layer output, decreasing the processing needed to interpret the data,it is also useful for extracting positional invariant dominant features, while maintaining the process of training of the model. This Layer gets the output image from the previous layer and returns a maximum aggregate value or Max Pooling, it can also minimize or convert it to another function, known as Average Pooling. This reduces the chance of overfitting by figuring out the probability of the most influential features.

The Convolutional Layer and the Pooling Layer can be daisy-chained allowing for the capture of lower-level features in more complex images. The increase in the number of layers has drawbacks however, since the network will need a lot more computational power to be processed.

Fully Connected Layer

Finally, we can flatten the pooling output and feed it to a classification neural network. This layer distinguishes and determines the probability of a low-level or dominating feature belonging to a particular class using the Softmax or Sigmoid classification functions . Just like most regular neural networks the model is trained in an iterative way using multiple epochs and backpropagation.

The Upscaling Problem


Image upscaling is a problem that many companies have to deal with, especially in a world where people need more space to store their files than ever before. If you think about all the pictures that are taken daily by smartphone users, and the demand for storage from those users, you can start picturing the importance of efficiently managing those files. An efficient way to deal with this is to store lower resolution images and upscaling them whenever the user needs to access that file, making it much lighter to store multiple sets of images on the database.

The upscaling process consists of taking a small low resolution image and turning it into a large high resolution file. This means that the new image will have much more pixels than the previous image, and you will need an algorithm to predict what the color of the new pixels in the high resolution image will be. There are many types of algorithms out there, and most of them use convolution networks to try and precisely predict the color of the new pixels that will be inserted into the upscaled image.

Grid Interpolation Ref:

Even though these algorithms may work in some cases, the image upscaling problem demands a lot more precision than what the traditional algorithms are able to deliver. Because of that, researchers started looking at Convolutional Neural Networks and deep learning to solve this problem in a much more precise way.


When we compare the output from the bi-cubic interpolation and the output from the Convolutional Neural Network (CNN) algorithms, we can clearly see that the CNNs deliver a much higher precision than the bi-cubic approach.

Applying CNN to the Upscaling Problem

Now that we talked about how CNNs work and introduced the Upscaling Problem, let’s make a program that will be able to take a low resolution image and upscale it, using only Convolutional layers.

To start off, we will need to find and import a dataset of images to train our model. For this article, we will be using a small image dataset from Berkeley, 2011 ( and the Keras API from TensorFlow

After importing the images, we will have to separate the training and the validation datasets and also define what image output size and upscaling factor will be used.

After importing the image dataset, we can scale the pixel color channels range between 0 and 1 instead of 0 to 255 and change the scale from RGB to YUV. This makes the training process easier and the final result will be better perceived by humans, since the model will take the y value of each image during training. After that we will separate the training and validation datasets and reduce the size of the images for training.

Here’s an example of a training and validation image after pre-processing. Note that for this example we are using an upscaling factor of 3.

Original Image in YUV and the Same Image with Reduced Resolution

Now that we have our images ready for training, we can write our model using Keras and start the training process. Our CNN will consist of 4 main layers, and a final conversion layer. We will be using “tanh” as our activation function and MSE as our loss function. The maximum number of epochs will be set 100, since the database is fairly small and we will choose ADAM as our optimization algorithm.

CNN layers:

  1. 64x5x5 layer used for feature identification;
  2. 32x3x3 layer used for finer feature identification;
  3. Another 32x3x3 layer used for finer feature identification;
  4. Sub-pixel convolution layer, where each new predicted pixel is placed accordingly to the output image’s size;
  5. The resulting matrix is converted into a 2 dimensional image matrix;
Network Layers Ref:

After training our model, we will have to apply it to the testing images and convert the output images back into the RGB colorspace before analysing the results. In our code, we defined a function called upscale_image that will do just that for us.

Finally, the program will output each upscaled image, and we will be able to compare the results with the bicubic upscaling and the original ground truth image.

Original Image, Lower Resolution and Upscaled Image

The complete code as well as more details can be found on our GitHub repository:

Final thoughts and conclusion

As we can see from the results above, the upscaling model does not output a perfect high resolution image like the original. This happened because we are not only using a very simple network, but especially because the number of images and epochs used was very low. Nevertheless, there are some images that present more detailed features, compared to the bicubic upscaling for instance. If we look at the fur on the bears’ pictures, we can clearly see higher detailed features, especially where there is higher contrast and color differences. Apart from the bears, the airplane image also reveals the benefit of more clear and defined colors when using the CNNs for image upscaling.

Even though this is a simple and efficient model, it does not produce the most detailed reconstructions that a CNN model could produce for image upscaling. For those types of applications that need a more precise and powerful model, there are other algorithms out there that might be worth considering (Note that these algorithms use other types of layers beyond just Convolution).

Here is a list of more complex neural networks worth considering, if you plan on implementing a Super resolution neural network for image upscaling: