Sunday, 11 September 2016

Building a (reusable?) deep neural network model using Tensorflow

I’ve been experimenting for more than two months with Tensorflow, and while I find it a bit more “low level” if compared to other libraries for machine learning, I like it and hopefully I am getting better at using it. During the learning process I found some minor “obstacles” so I decided to write a short tutorial on how to use this amazing deep learning library.

At the beginning, Tensorflow was a bit of an headache for me since I wasn’t really used to the concept of a computational graph but most of all I could not find tutorials using data other than the MNIST dataset or some built-in dataset. The fact that there are built-in functions that return the training and testing data in a “ready to be fed” format, leaves out the explanation on how to proceed if the dataset you need to use is not a series of pixel readings such as MNIST, for instance, contains a series different categorical variables and real variables.

The model I’m going to walk you through seems to be quite flexible and, at least to me, reusable, provided that your data is in a “standard format” (see the the code below for more information on this).

Things to keep in mind when approaching Tensorflow from scratch:

1. Knowing how to use Pandas, Scikit-learn and Numpy is very useful and can definitely improve yor experience using Tensorflow. I think Numpy is the only mandatory requirement, but if you want to have a simpler life during the preprocessing stages, then I would strongly suggest you get in touch with the other two libraries as well.
2. There aren’t “prepackaged” models, e.g. Scikit-learn style, fit and predict models, with Tensorflow you are given the building blocks for building your own model from the basic operations. So you are basically on your own! While this path may require a bit more practice and patience, it enables you to get more customizable models and, at least in my experience, a better hands on experience. You can then build your own collection of “ready to go” models which is way cooler! By the way, there are things like Skflow that are inspired from Scikit-Learn, but if you plan on using neural networks I personally believe learning Tensorflow is more valuable.
3. Is it really necessary to use Tensorflow/a deep neural network? Not always. Sometimes using Tensorflow seems like, to quote an interesting post I read lately, bringing a tank to a knife fight.  Simpler models are usually preferable, sometimes are more than enough and sometimes they are even better than complex models, but neural networks are cool, so no arguments here.

The (unbalanced) dataset

I’m going to use a dataset publicly available at UCI machine learning repository.

The dataset I’m going to use is the Bank Marketing Data Set which contains data on the marketing campaign of a portuguese banking institution. The dataset has 20 mixed type predictors (real valued, binary categorical and multivariable categorical variables) and the variable to be predicted is a binary response on whether the client has subscribed a deposit or not. This is a nice unbalanced dataset, the binary response is splitted 10-90%.

The number of instances is 45211. Nice sidenote: a smaller dataset (10% of the original one) is provided, this is awesome if you need to run some quick tests!

The preprocessing steps

As far as preprocessing is concerned, the first thing I had to do was converting the categorical variables in a format that could be fed to Tensorflow, otherwise, be sure of this, it will scream at you and throw every possible exception you could imagine Smile.  The get_dummies() method available in pandas was really useful in achieving this.

Then I normalized the data using sklearn.

As you can see, sklearn and pandas are not needed, but they save a lot of time and coding.

Fun fact: if you try to run the model on the full dataset, at this stage, you’ll get 90% accuracy. How much is this result worth though? Very little, I would argue, and here’s why: the dataset is heavily unbalanced, if you just pick one answer and stick to it every time, you’ll end up in one case with a 90% accuracy and in the other with a 10% accuracy, on average. It would be wise to check that your classifier is not using this (perhaps attractive) strategy, since in that case it wouldn’t be doing much of a good job, especially if our objective is to predict whether a certain customer will or will not buy a deposit. If indeed it used that strategy, the classifier would always say no. It would be right 90% of the times, of course, but of what use would such a silly classifier be? None, I would argue.

For the reason stated above, I decided to built a 50% - 50% new dataset that, although using less data, could provide a better solution to the problem. This route is quickly but wastes precious data. Another possibility would have been training the classifier by making the less frequent class appear more in the training data compared to the other one. There are different methods, but since this was a quick exercise I chose this route which then resulted in an acceptable result.

The model

The model is a simple deep multi layer perceptron. I made a little effort in trying to make the model a bit customizable. For instance, as long as your data is in a .csv format as specified, the model will be able to deal with it and you just need to input the path to the file. The activation functions within layers and in the output layer can be customized as well. The model assumes that all the variables, except from the Y_LABEL, are predictors to be used in the process.

The model runs fine without problems, you can play around with parameters but I suggest using your GPU if you are not planning on spending the entire day running models.

Conclusions

The model seems to perform quite good with many different configurations yielding a consistent 86% accuracy score with different. This 86% score is lower than the 90% we’d have got by sticking to the most frequent class, but, at least in my opinion, it is more useful. Bear in mind that this score has been obtained using a balanced version of the original dataset. Getting back to the “real world”, if the aim of the bank (or the company in general) is to focus their efforts on the clients that are more likely to buy their products, then this model would be much more useful since it can give an “educated guess” on each client and seems to be around 80% accurate. Of course some more fine tuning and a cross validation process are necessary and would significantly improve the reliability of the model before putting it to actual use.

Also, as a sidenote, running these tests on a CPU makes you realize how badly you need a GPU ;)

The customizable version of the model used is available here.


The dataset has been downloaded from the following source(s):

1 [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

No comments:

Post a Comment