Friday, 5 August 2016

Image recognition tutorial in R using deep convolutional neural networks (MXNet package)

This is a detailed tutorial on image recognition in R using a deep convolutional neural network provided by the MXNet package. After a short post I wrote some times ago I received a lot of requests and emails for a much more detailed explanation, therefore I decided to write this tutorial. This post will show a reproducible example on how to get 97.5% accuracy score on a faces recognition task in R.



I feel some premises are necessary. I’m writing this tutorial with two main objectives.
The first one is to provide a fully reproducible example to everyone; the second objective is to provide a, hopefully, clear explanation to the most frequent questions I received via email on the old post. Please be aware that this is just how I decided to tackle this particular problem, this is certainly not the only way to solve it, and most certainly it is not the best. If you have any suggestion to improve the model or the process, or should you spot any mistake, please do leave a comment.


I’m going to use both Python 3.x (for getting the data and some preprocessing) and R (for the core of the tutorial), so make sure to have both installed. The package requirements for R are the following:

  1. MXNet. This package will provide you with the model we are going to be using in this tutorial, namely a deep convolutional neural network. You don’t need the GPU version, CPU will suffice for this task although it could be slower. If you feel like it, then go for the GPU version.
  2. EBImage. This package provides a wide variety of tools to work with images. It is a pleasure to work on images using this package, the documentation is crystal clear and pretty straightforward.

As for Python 3.x please make sure to install both Numpy and Scikit-learn. I suggest to install the Anaconda distribution which already has installed some commonly used packages for data science and machine learning.

Once you have all these things installed and running, you are set to go.

The dataset

I am going to use the Olivetti faces dataset1. This dataset is a collection of 64x64 pixel 0-256 greyscale images.

The dataset contains a total of 400 images of 40 subjects. With just 10 samples for each subject it is usually used for unsupervised or semi-supervised algorithms, but I’m going to try my best with the selected supervised method.

First of all, you need to scale the images in the 0-1 range. This is automatically done by the function we are going to use to download the dataset, so you do not need to worry about it that much, just be aware that it has already been done for you. Should you use your own images be sure to scale them down in the 0-1 range (or –1,1 range, although the first seems work better with neural networks in my experience). This below is the Python script you need to run in order to download the dataset. Just change the paths to your selected paths and then run it either from your IDE or the terminal.

What this piece of code does is basically download the data, reshape the X datapoints (the images) and then save the numpy arrays to a .csv file.

As you can see, the x array is a tensor (tensor is a fancy name for a multidimensional matrix) of shape (400, 64, 64): this means that the x array contains 400 samples of 64x64 matrices (read images). When in doubt about these things, just print out the first elements of the tensor and try to figure out the structure of the data with what you know. For instance, we know from the dataset description that we have 400 samples and that each image is 64x64 pixel. We flatten the x tensor into a matrix of shape  400x4096. That is, each 64x64 matrix (image) is now converted (flattened) to a row vector of length 4096.

As for y, y is already a simple column vector of size 400. No need to reshape it.

Take a look at the generated .csv and check that everything makes sense to you.

Some more preprocessing with R

Now we are going to use EBImage to resize each image to 28x28 pixel and generate the training and testing datasets. You may ask why I am resizing the images, well, for some reason my machine does not like 64x64 pixel images and my PC crash whenever I run the model with the data. Not good. But that’s ok since we can get good results with smaller images too (you can try running the model with the 64x64 size images if your PC does not crash though). Let’s go on now:

This part of the tutorial should be pretty self explanatory, if you are not sure on what is the output, I suggest taking a look at the rs_df dataframe. It should be a 400x785 dataframe as follows:

label, pixel1, pixel2, … , pixel784
0, 0.2, 0.3, … ,0.1

Building the model

Now for the fun part, let’s build the model. Below you can find the script I used to train and test the model. After the script you can find my comments and explanations on the code.

After having loaded the training and testing dataset, I convert each dataframe to a numeric matrix using the data.matrix function. Remember that the first column of the data represents the labels associated with each image. Be sure to remove the labels from train_array and test_array. Then, after having separated the predictors from the labels, you need to tell MXNet the shape of the data. This is what I did at line 19 with the following piece of code “dim(train_array) <- c(28, 28, 1, ncol(train_x))” for the training array and at line 24 for the test arrray. By doing this, we are essentially telling the model that the training data is made of ncol(train_x) samples (360 images) of shape 28x28. The number 1 signals that the images are greyscale, ie they have only 1 channel. If the images were RGB then the 1 should have been replaced by a 3 since in that case each image would have 3 channels.

As far as the model structure is concerned, this is a  variation of the LeNet model using hyperbolic tangent instead of “Relu” as activation function, 2 convolutional layers, 2 pooling layers, 2 fully connected layer, and a typical softmax output.

Each convolutional layer is using a 5x5 kernel and applying a fixed number of filters, check this awesome video for some qualitative insight into convolutional layers. The pooling layers apply a classical “max pooling” approach.

During my tests tanh seemd to work far better than sigmoid and Relu but you can try and use the other activation functions if you want.

As for the model hyperparameters, the learning rate is a little higher than usual but it turns out that it works fine, while the selected number of epochs is 480. Batch size of 40 seems to work fine. These hyperparameters were found out from many trials and error. As far as I know I could have performed a grid search at best, but I did not want to over complicate the code and therefore I just used the classical trial and error approach.

At the end you should get 0.975 accuracy.


All in all, this model was fairly easy to set up and run. When running on CPU it takes about 4-5 minutes to train which is a bit long if you want to make some experiments but it is reasonable to work with.

Considering the fact that we did not do any feature engineering and just made some simple and ordinary preprocessing steps, I believe the result achieved is quite good. Of course, in case we would like to get a better grasp on the “true” accuracy,  we would need to perform some cross validation procedures (that will inevitably eat up a lot of computing time Smile).

Thank you for reading this far, I hope this tutorial helped you understanding how to set up and run this particular model.

Dataset source and credits:

1Olivetti faces dataset was made by AT&T Laboratories Cambridge.


  1. Hi, thanks for the detailed post.
    The dataset inputted in the model is 40 faces, with 10 variations each. I am trying, however, to apply the algorithm on biological populations of cells.
    Supposing we have a good training dataset of biological cells of various population (let says, 10 cell-line, with 100 pictures for each population). Does the algorithm described above could perform a discrimination of 2 or more population when cells mixed in an image? I am not sure if the presented algorithms would allows to do so. Thank you for your reply,


    1. Hi Arno, thanks for reading the post and for your comment.

      In general, the model will be able to discriminate only if you've told it how to do that. For instance if you have two faces in the same picture, in the example above it will "chose" to output as an answer either one or the other face, since it has not been trained to do otherwise.

      If you have more labels in the same picture, you can either try to separate them and then pass each one to a classifier, use a different model, or even use some more sophisticated tools such as openCV with Python which can detect objects in a picture very accurately. Ultimately the choice will depend on the details of the specific problem you are trying to solve.

  2. Hi, thanks for post. Hope you make more posts on neural networks using R.
    Why not use Olivetti faces from RnavGraphImageData package and stay fully in R?

    1. Hi Lenchik! Thanks for reading the post! I didn't know it was already available in R, thanks for letting me know.

  3. Hi,

    First of all thanks for this post. Very informative.
    I am struggling to install "mxnet" package in my R.

    Could you please guide what needs to be done in order to use it.

  4. Hi, I was able to fix "Mxnet" issue mentioned in my previous comment.

    It ran perfectly and gave similar results. However, when I changed the seed from 100 to some other number (Just before shuffling itno training and testing), I get an accuracy of 0.325.

    Isn't it supposed to be somewhat similar of model is working fine ?
    Am I missing on something ?

    1. Hi Dimri, thanks for reading the post! Sorry for my late reply. I had a bit of a struggle too when installing Mxnet.

      Yes changing the seed should (generally) give you roughly similar accuracy results. But it may also not: it can also happen that a particular combination of test sample happened to contain particularly critical tests for the model, for instance you trained on a training set of 5 different pictures and the classifier always struggles with picture 1 (let's say it always gets 20% accuracy on it). By chance and by splitting randomly the testing set you got a testing set of 90% picture 1 samples. Now it may easily look that your classifier has an accuracy of around 20% while instead that result is simply driven by a biased testing set.

      In order to partly account for this problem and to get a better estimate of how good your model is performing, a cross validation may help testing that scenario.