This is a detailed tutorial on image recognition in R using a deep convolutional neural network provided by the **MXNet package**. After a short post I wrote some times ago I received a lot of requests and emails for a much more detailed explanation, therefore I decided to write this tutorial. This post will show a reproducible example on how to get 97.5% accuracy score on a faces recognition task in R.

Premises

I feel some premises are necessary. I’m writing this **tutorial** with two main objectives.

The first one is to provide a fully reproducible example to everyone; the second objective is to provide a, hopefully, clear explanation to the most frequent questions I received via email on the old post. Please be aware that this is just how I decided to tackle this particular problem, this is certainly not the only way to solve it, and most certainly it is not the best. If you have any suggestion to improve the model or the process, or should you spot any mistake, please do leave a comment.

## Requirements

I’m going to use both **Python 3.x** (for getting the data and some preprocessing) and **R** (for the core of the tutorial), so make sure to have both installed. The package requirements for R are the following:

- MXNet. This package will provide you with the model we are going to be using in this tutorial, namely a deep convolutional neural network. You don’t need the GPU version, CPU will suffice for this task although it could be slower. If you feel like it, then go for the GPU version.
- EBImage. This package provides a wide variety of tools to work with images. It is a pleasure to work on images using this package, the documentation is crystal clear and pretty straightforward.

As for Python 3.x please make sure to install both** ****Numpy** and** ****Scikit-learn**. I suggest to install the **Anaconda** distribution which already has installed some commonly used packages for **data science** and **machine learning**.

Once you have all these things installed and running, you are set to go.

## The dataset

I am going to use the Olivetti faces dataset^{1}. This dataset is a collection of 64x64 pixel 0-256 greyscale images.

The dataset contains a total of 400 images of 40 subjects. With just 10 samples for each subject it is usually used for **unsupervised **or **semi-supervised** algorithms, but I’m going to try my best with the selected **supervised **method.

First of all, you need to scale the images in the 0-1 range. This is automatically done by the function we are going to use to download the dataset, so you do not need to worry about it that much, just be aware that it has already been done for you. Should you use your own images be sure to **scale** them down in the 0-1 range (or –1,1 range, although the first seems work better with neural networks in my experience). This below is the Python script you need to run in order to download the dataset. Just change the paths to your selected paths and then run it either from your IDE or the terminal.

What this piece of code does is basically download the data, reshape the X datapoints (the images) and then save the numpy arrays to a .csv file.

As you can see, the x array is a tensor (tensor is a fancy name for a multidimensional matrix) of shape (400, 64, 64): this means that the x array contains 400 samples of 64x64 matrices (read images). When in doubt about these things, just print out the first elements of the tensor and try to figure out the structure of the data with what you know. For instance, we know from the dataset description that we have 400 samples and that each image is 64x64 pixel. We flatten the x tensor into a matrix of shape 400x4096. That is, each 64x64 matrix (image) is now converted (flattened) to a row vector of length 4096.

As for y, y is already a simple column vector of size 400. No need to reshape it.

Take a look at the generated .csv and check that everything makes sense to you.

## Some more preprocessing with R

Now we are going to use **EBImage** to resize each image to 28x28 pixel and generate the training and testing datasets. You may ask why I am resizing the images, well, for some reason my machine does not like 64x64 pixel images and my PC crash whenever I run the model with the data. Not good. But that’s ok since we can get good results with smaller images too (you can try running the model with the 64x64 size images if your PC does not crash though). Let’s go on now:

This part of the tutorial should be pretty self explanatory, if you are not sure on what is the output, I suggest taking a look at the **rs_df **dataframe. It should be a 400x785 dataframe as follows:

label, pixel1, pixel2, … , pixel784

0, 0.2, 0.3, … ,0.1

## Building the model

Now for the fun part, let’s build the model. Below you can find the script I used to train and test the model. After the script you can find my comments and explanations on the code.

After having loaded the training and testing dataset, I convert each dataframe to a numeric matrix using the *data.matrix* function. Remember that the first column of the data represents the labels associated with each image. Be sure to remove the labels from *train_array* and *test_array*. Then, after having separated the predictors from the labels, you need to tell MXNet the shape of the data. This is what I did at line 19 with the following piece of code “dim(train_array) <- c(28, 28, 1, ncol(train_x))” for the training array and at line 24 for the test arrray. By doing this, we are essentially telling the model that the training data is made of ncol(train_x) samples (360 images) of shape 28x28. The number 1 signals that the images are greyscale, ie they have only 1 channel. If the images were RGB then the 1 should have been replaced by a 3 since in that case each image would have 3 channels.

As far as the model structure is concerned, this is a variation of the LeNet model using hyperbolic tangent instead of “Relu” as activation function, 2 convolutional layers, 2 pooling layers, 2 fully connected layer, and a typical softmax output.

Each **convolutional layer** is using a 5x5 kernel and applying a fixed number of filters, check this awesome video for some qualitative insight into convolutional layers. The **pooling layers** apply a classical “max pooling” approach.

During my tests **tanh** seemd to work far better than sigmoid and Relu but you can try and use the other activation functions if you want.

As for the model **hyperparameters**, the learning rate is a little higher than usual but it turns out that it works fine, while the selected number of epochs is 480. Batch size of 40 seems to work fine. These hyperparameters were found out from many trials and error. As far as I know I could have performed a grid search at best, but I did not want to over complicate the code and therefore I just used the classical trial and error approach.

At the end you should get 0.975 accuracy.

## Conclusions

All in all, this model was fairly easy to set up and run. When running on CPU it takes about 4-5 minutes to train which is a bit long if you want to make some experiments but it is reasonable to work with.

Considering the fact that we did not do any feature engineering and just made some simple and ordinary preprocessing steps, I believe the result achieved is quite good. Of course, in case we would like to get a better grasp on the “true” accuracy, we would need to perform some cross validation procedures (that will inevitably eat up a lot of computing time ).

Thank you for reading this far, I hope this tutorial helped you understanding how to set up and run this particular model.

Dataset source and credits:

^{1}Olivetti faces dataset was made by AT&T Laboratories Cambridge.

Hi, thanks for the detailed post.

ReplyDeleteThe dataset inputted in the model is 40 faces, with 10 variations each. I am trying, however, to apply the algorithm on biological populations of cells.

Supposing we have a good training dataset of biological cells of various population (let says, 10 cell-line, with 100 pictures for each population). Does the algorithm described above could perform a discrimination of 2 or more population when cells mixed in an image? I am not sure if the presented algorithms would allows to do so. Thank you for your reply,

Arno

Hi Arno, thanks for reading the post and for your comment.

DeleteIn general, the model will be able to discriminate only if you've told it how to do that. For instance if you have two faces in the same picture, in the example above it will "chose" to output as an answer either one or the other face, since it has not been trained to do otherwise.

If you have more labels in the same picture, you can either try to separate them and then pass each one to a classifier, use a different model, or even use some more sophisticated tools such as openCV with Python which can detect objects in a picture very accurately. Ultimately the choice will depend on the details of the specific problem you are trying to solve.

Hi, thanks for post. Hope you make more posts on neural networks using R.

ReplyDeleteWhy not use Olivetti faces from RnavGraphImageData package and stay fully in R?

Hi Lenchik! Thanks for reading the post! I didn't know it was already available in R, thanks for letting me know.

DeleteHi,

ReplyDeleteFirst of all thanks for this post. Very informative.

I am struggling to install "mxnet" package in my R.

Could you please guide what needs to be done in order to use it.

Hi, I was able to fix "Mxnet" issue mentioned in my previous comment.

ReplyDeleteIt ran perfectly and gave similar results. However, when I changed the seed from 100 to some other number (Just before shuffling itno training and testing), I get an accuracy of 0.325.

Isn't it supposed to be somewhat similar of model is working fine ?

Am I missing on something ?

Hi Dimri, thanks for reading the post! Sorry for my late reply. I had a bit of a struggle too when installing Mxnet.

DeleteYes changing the seed should (generally) give you roughly similar accuracy results. But it may also not: it can also happen that a particular combination of test sample happened to contain particularly critical tests for the model, for instance you trained on a training set of 5 different pictures and the classifier always struggles with picture 1 (let's say it always gets 20% accuracy on it). By chance and by splitting randomly the testing set you got a testing set of 90% picture 1 samples. Now it may easily look that your classifier has an accuracy of around 20% while instead that result is simply driven by a biased testing set.

In order to partly account for this problem and to get a better estimate of how good your model is performing, a cross validation may help testing that scenario.

I cannot see there the dependent variable is stored? Somehow the information needs to be transmitted which picture blongs to which person. Where is that done?

ReplyDeleteHi, the labels are stored into the tran_y and test_y variables.

DeleteI had issues with the EBImage library. Hence I did the step "Some more preprocessing with R" using this. Attention to Changes: (1) No downscaling to 28x28. I keep the full resolution of 64 x 64 and at the end. (2) At the end a RData file instead of .CSV is created.

ReplyDelete#

# This code is from https://firsttimeprogrammer.blogspot.de/2016/08/image-recognition-tutorial-in-r-using.html

#

# This script is used to resize images from 64x64 to 28x28 pixels

# Clear workspace

rm(list=ls())

setwd("/gpudev/projects/E610046/sandbox/cnn_example1")

# Load data

X <- read.csv("data/olivetti_X.csv", header = F)

labels <- read.csv("data/olivetti_y.csv", header = F)

# Dataframe of resized images

rs_df <- data.frame()

# Main loop: for each image, resize and set it to greyscale

rs_df <- data.frame(labels,X)

rs_df[1:10,1:10]

colnames(rs_df) <- c("labels",paste0("pixel",1:ncol(X)))

# Train-test split

#-------------------------------------------------------------------------------

# Simple train-test split. No crossvalidation is done in this tutorial.

# Set seed for reproducibility purposes

set.seed(100)

# Shuffled df

shuffled <- rs_df[sample(1:nrow(rs_df)),]

# Train-test split

train_64 <- shuffled[1:360, ]

test_64 <- shuffled[361:400, ]

# Save train-test datasets

save(test_64, file="data/test_64.RData")

save(train_64, file="data/train_64.RData")

# Done!

print("Done!")

"Building the model" with batch normalization: After 6-8 iterations, test accuracy is same as above (97.5%). Uses an own implementation of early stopping (see comment below!)

ReplyDelete#'

#' with batch normalization

#' with early stopping

#'

# Clean workspace

rm(list=ls())

# Load MXNet

require(mxnet)

# load custom loss function (early stopping)

setwd("/gpudev/projects/E610046/sandbox/cnn_example1/fun/early_stopping.R")

# Loading data and set up

#-------------------------------------------------------------------------------

# Load train and test datasets

load("data/test_64.RData")

load("data/train_64.RData")

train <- train_64; rm(train_64)

test <- test_64; rm(test_64)

# Set up train and test datasets

train <- data.matrix(train)

train_x <- t(train[, -1])

train_y <- train[, 1]

train_array <- train_x

dim(train_array) <- c(64, 64, 1, ncol(train_x))

test_x <- t(test[, -1])

test_y <- test[, 1]

test_array <- test_x

dim(test_array) <- c(64, 64, 1, ncol(test_x))

# Set up the symbolic model

#-------------------------------------------------------------------------------

data <- mx.symbol.Variable('data')

# 1st convolutional layer

conv_1 <- mx.symbol.Convolution(data = data, kernel = c(5, 5), num_filter = 1)

tanh_1 <- mx.symbol.Activation(data = conv_1, act_type = "tanh")

pool_1 <- mx.symbol.Pooling(data = tanh_1, pool_type = "max", kernel = c(2, 2), stride = c(2, 2))

# 2nd convolutional layer

bn_1 <- mx.symbol.BatchNorm(pool_1)

conv_2 <- mx.symbol.Convolution(data = bn_1, kernel = c(5, 5), num_filter = 50)

tanh_2 <- mx.symbol.Activation(data = conv_2, act_type = "tanh")

pool_2 <- mx.symbol.Pooling(data=tanh_2, pool_type = "max", kernel = c(2, 2), stride = c(2, 2))

# 1st fully connected layer

bn_2 <- mx.symbol.BatchNorm(pool_2)

flatten <- mx.symbol.Flatten(data = bn_2)

fc_1 <- mx.symbol.FullyConnected(data = flatten, num_hidden = 500)

tanh_3 <- mx.symbol.Activation(data = fc_1, act_type = "tanh")

# 2nd fully connected layer

bn_3 <- mx.symbol.Flatten(data = tanh_3)

fc_2 <- mx.symbol.FullyConnected(data = bn_3, num_hidden = 40)

# Output. Softmax output since we'd like to get some probabilities.

NN_model <- mx.symbol.SoftmaxOutput(data = fc_2)

# Pre-training set up

#-------------------------------------------------------------------------------

# Set seed for reproducibility

mx.set.seed(100)

# Device used. CPU in my case.

devices <- mx.cpu()

# Training

#-------------------------------------------------------------------------------

logger <- mx.metric.logger$new()

# iterator for eval error

valIter <-

mx.io.arrayiter(

data = test_array,

label = test_y,

batch.size = 40,

shuffle = F

)

# Train the model

model <- mx.model.FeedForward.create(

NN_model,

X = train_array,

eval.data = valIter,

y = train_y,

ctx = devices,

num.round = 130,

array.batch.size = 10,

learning.rate = 0.01,

momentum = 0.9,

eval.metric = mx.metric.accuracy,

epoch.end.callback = mx.callback.custom_early.stop_log.train.metric(

period = 40,

bad.steps = 3,

logger,

verbose = T,

maximize = T

)

)

# Testing

#-------------------------------------------------------------------------------

# Predict labels

predicted <- predict(model, test_array)

# Assign labels

predicted_labels <- max.col(t(predicted)) - 1

# Get accuracy

sum(diag(table(test_y, predicted_labels)))/40

################################################################################

# OUTPUT

################################################################################

#

# 0.975

#

# Author: TG

Here's my custom callback function I used above to track validation error and to use early stopping.

Deletemx.callback.custom_early.stop_log.train.metric <-

function (period,

logger = NULL,

train.metric = NULL,

eval.metric = NULL,

bad.steps = NULL,

maximize = FALSE,

verbose = FALSE)

{

function(iteration, nbatch, env, verbose = verbose) {

if (!is.null(env$metric)) {

if (!is.null(train.metric)) {

result <- env$metric$get(env$train.metric)

if ((maximize == F & result$value < train.metric) |

(maximize == TRUE & result$value > train.metric)) {

return(FALSE)

}

}

if (!is.null(eval.metric)) {

if (!is.null(env$eval.metric)) {

result <- env$metric$get(env$eval.metric)

if ((maximize == F & result$value < eval.metric) |

(maximize == TRUE & result$value > eval.metric)) {

return(FALSE)

}

}

}

}

## @TG log train + eval error

## start insertion

if (nbatch %% period == 0 && !is.null(env$metric)) {

result <- env$metric$get(env$train.metric)

if (nbatch != 0 & verbose)

message(paste0(

"Batch [",

nbatch,

"] Train-",

result$name,

"=",

result$value

))

if (!is.null(logger)) {

if (class(logger) != "mx.metric.logger") {

stop("Invalid mx.metric.logger.")

}

logger$train <- c(logger$train, result$value)

if (!is.null(env$eval.metric)) {

result <- env$metric$get(env$eval.metric)

if (nbatch != 0 & verbose)

message(paste0(

"Batch [",

nbatch,

"] Validation-",

result$name,

"=",

result$value

))

logger$eval <- c(logger$eval, result$value)

}

}

}

## end insertion

if (!is.null(bad.steps)) {

if (iteration == 1) {

mx.best.iter <<- 1

if (maximize) {

mx.best.score <<- 0

}

else {

mx.best.score <<- Inf

}

}

if (!is.null(env$eval.metric)) {

result <- env$metric$get(env$eval.metric)

# @TG stop if validation error is NaN

if(is.na(result$value)) {

return(FALSE)

}

#

if ((maximize == F & result$value > mx.best.score) |

(maximize == TRUE & result$value <= mx.best.score)) {

# (maximize == TRUE & result$value < mx.best.score)) { # TG, 19.04.2018

if (mx.best.iter == bad.steps) {

if (verbose) {

message(

paste0(

"Best score=",

mx.best.score,

", iteration [",

iteration - bad.steps,

"]"

)

)

}

return(FALSE)

}

else {

mx.best.iter <<- mx.best.iter + 1

}

}

else {

mx.best.score <<- result$value

mx.best.iter <<- 1

}

}

}

return(TRUE)

}

}