Loading [MathJax]/extensions/MathMenu.js

Sunday, 11 September 2016

Building a (reusable?) deep neural network model using Tensorflow

I’ve been experimenting for more than two months with Tensorflow, and while I find it a bit more “low level” if compared to other libraries for machine learning, I like it and hopefully I am getting better at using it. During the learning process I found some minor “obstacles” so I decided to write a short tutorial on how to use this amazing deep learning library.

At the beginning, Tensorflow was a bit of an headache for me since I wasn’t really used to the concept of a computational graph but most of all I could not find tutorials using data other than the MNIST dataset or some built-in dataset. The fact that there are built-in functions that return the training and testing data in a “ready to be fed” format, leaves out the explanation on how to proceed if the dataset you need to use is not a series of pixel readings such as MNIST, for instance, contains a series different categorical variables and real variables.

The model I’m going to walk you through seems to be quite flexible and, at least to me, reusable, provided that your data is in a “standard format” (see the the code below for more information on this).

Things to keep in mind when approaching Tensorflow from scratch:

1. Knowing how to use Pandas, Scikit-learn and Numpy is very useful and can definitely improve yor experience using Tensorflow. I think Numpy is the only mandatory requirement, but if you want to have a simpler life during the preprocessing stages, then I would strongly suggest you get in touch with the other two libraries as well.
2. There aren’t “prepackaged” models, e.g. Scikit-learn style, fit and predict models, with Tensorflow you are given the building blocks for building your own model from the basic operations. So you are basically on your own! While this path may require a bit more practice and patience, it enables you to get more customizable models and, at least in my experience, a better hands on experience. You can then build your own collection of “ready to go” models which is way cooler! By the way, there are things like Skflow that are inspired from Scikit-Learn, but if you plan on using neural networks I personally believe learning Tensorflow is more valuable.
3. Is it really necessary to use Tensorflow/a deep neural network? Not always. Sometimes using Tensorflow seems like, to quote an interesting post I read lately, bringing a tank to a knife fight.  Simpler models are usually preferable, sometimes are more than enough and sometimes they are even better than complex models, but neural networks are cool, so no arguments here.

The (unbalanced) dataset

I’m going to use a dataset publicly available at UCI machine learning repository.

The dataset I’m going to use is the Bank Marketing Data Set which contains data on the marketing campaign of a portuguese banking institution. The dataset has 20 mixed type predictors (real valued, binary categorical and multivariable categorical variables) and the variable to be predicted is a binary response on whether the client has subscribed a deposit or not. This is a nice unbalanced dataset, the binary response is splitted 10-90%.

The number of instances is 45211. Nice sidenote: a smaller dataset (10% of the original one) is provided, this is awesome if you need to run some quick tests!

The preprocessing steps

As far as preprocessing is concerned, the first thing I had to do was converting the categorical variables in a format that could be fed to Tensorflow, otherwise, be sure of this, it will scream at you and throw every possible exception you could imagine Smile.  The get_dummies() method available in pandas was really useful in achieving this.

Then I normalized the data using sklearn.

import pandas as pd
import numpy as np
from sklearn import preprocessing
# Load data
data = pd.read_csv('bank-additional-full.csv', sep = ";")
# Variables names
var_names = data.columns.tolist()
# Categorical vars
categs = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']
# Quantitative vars
quantit = [i for i in var_names if i not in categs]
# Get dummy variables for categorical vars
job = pd.get_dummies(data['job'])
marital = pd.get_dummies(data['marital'])
education = pd.get_dummies(data['education'])
default = pd.get_dummies(data['default'])
housing = pd.get_dummies(data['housing'])
loan = pd.get_dummies(data['loan'])
contact = pd.get_dummies(data['contact'])
month = pd.get_dummies(data['month'])
day = pd.get_dummies(data['day_of_week'])
poutcome = pd.get_dummies(data['poutcome'])
# Map variable to predict
dict_map = dict()
y_map = {'yes':1,'no':0}
dict_map['y'] = y_map
data = data.replace(dict_map)
label = data['y']
df1 = data[quantit]
df1_names = df1.keys().tolist()
# Scale quantitative variables
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df1)
df1 = pd.DataFrame(x_scaled)
df1.columns = df1_names
# Get final df
final_df = pd.concat([df1,
job,
marital,
education,
default,
housing,
loan,
contact,
month,
day,
poutcome,
label], axis=1)
# Quick check
print(final_df.head())
# Save df
final_df.to_csv('bank_normalized.csv', index = False)

As you can see, sklearn and pandas are not needed, but they save a lot of time and coding.

Fun fact: if you try to run the model on the full dataset, at this stage, you’ll get 90% accuracy. How much is this result worth though? Very little, I would argue, and here’s why: the dataset is heavily unbalanced, if you just pick one answer and stick to it every time, you’ll end up in one case with a 90% accuracy and in the other with a 10% accuracy, on average. It would be wise to check that your classifier is not using this (perhaps attractive) strategy, since in that case it wouldn’t be doing much of a good job, especially if our objective is to predict whether a certain customer will or will not buy a deposit. If indeed it used that strategy, the classifier would always say no. It would be right 90% of the times, of course, but of what use would such a silly classifier be? None, I would argue.

For the reason stated above, I decided to built a 50% - 50% new dataset that, although using less data, could provide a better solution to the problem. This route is quickly but wastes precious data. Another possibility would have been training the classifier by making the less frequent class appear more in the training data compared to the other one. There are different methods, but since this was a quick exercise I chose this route which then resulted in an acceptable result.

The model

The model is a simple deep multi layer perceptron. I made a little effort in trying to make the model a bit customizable. For instance, as long as your data is in a .csv format as specified, the model will be able to deal with it and you just need to input the path to the file. The activation functions within layers and in the output layer can be customized as well. The model assumes that all the variables, except from the Y_LABEL, are predictors to be used in the process.

import tensorflow as tf
import pandas as pd
from sklearn.cross_validation import train_test_split
FILE_PATH = '~/Desktop/bank-add/bank_equalized.csv' # Path to .csv dataset
raw_data = pd.read_csv(FILE_PATH) # Open raw .csv
print("Raw data loaded successfully...\n")
#------------------------------------------------------------------------------
# Variables
Y_LABEL = 'y' # Name of the variable to be predicted
KEYS = [i for i in raw_data.keys().tolist() if i != Y_LABEL] # Name of predictors
N_INSTANCES = raw_data.shape[0] # Number of instances
N_INPUT = raw_data.shape[1] - 1 # Input size
N_CLASSES = raw_data[Y_LABEL].unique().shape[0] # Number of classes (output size)
TEST_SIZE = 0.1 # Test set size (% of dataset)
TRAIN_SIZE = int(N_INSTANCES * (1 - TEST_SIZE)) # Train size
LEARNING_RATE = 0.001 # Learning rate
TRAINING_EPOCHS = 400 # Number of epochs
BATCH_SIZE = 100 # Batch size
DISPLAY_STEP = 20 # Display progress each x epochs
HIDDEN_SIZE = 200 # Number of hidden neurons 256
ACTIVATION_FUNCTION_OUT = tf.nn.tanh # Last layer act fct
STDDEV = 0.1 # Standard deviation (for weights random init)
RANDOM_STATE = 100 # Random state for train_test_split
print("Variables loaded successfully...\n")
print("Number of predictors \t%s" %(N_INPUT))
print("Number of classes \t%s" %(N_CLASSES))
print("Number of instances \t%s" %(N_INSTANCES))
print("\n")
print("Metrics displayed:\tPrecision\n")
#------------------------------------------------------------------------------
# Loading data
# Load data
data = raw_data[KEYS].get_values() # X data
labels = raw_data[Y_LABEL].get_values() # y data
# One hot encoding for labels
labels_ = np.zeros((N_INSTANCES, N_CLASSES))
labels_[np.arange(N_INSTANCES), labels] = 1
# Train-test split
data_train, data_test, labels_train, labels_test = train_test_split(data,
labels_,
test_size = TEST_SIZE,
random_state = RANDOM_STATE)
print("Data loaded and splitted successfully...\n")
#------------------------------------------------------------------------------
# Neural net construction
# Net params
n_input = N_INPUT # input n labels
n_hidden_1 = HIDDEN_SIZE # 1st layer
n_hidden_2 = HIDDEN_SIZE # 2nd layer
n_hidden_3 = HIDDEN_SIZE # 3rd layer
n_hidden_4 = HIDDEN_SIZE # 4th layer
n_classes = N_CLASSES # output m classes
# Tf placeholders
X = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])
dropout_keep_prob = tf.placeholder(tf.float32)
def mlp(_X, _weights, _biases, dropout_keep_prob):
layer1 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(_X, _weights['h1']), _biases['b1'])), dropout_keep_prob)
layer2 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(layer1, _weights['h2']), _biases['b2'])), dropout_keep_prob)
layer3 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(layer2, _weights['h3']), _biases['b3'])), dropout_keep_prob)
layer4 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(layer3, _weights['h4']), _biases['b4'])), dropout_keep_prob)
out = ACTIVATION_FUNCTION_OUT(tf.add(tf.matmul(layer4, _weights['out']), _biases['out']))
return out
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1],stddev=STDDEV)),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2],stddev=STDDEV)),
'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],stddev=STDDEV)),
'h4': tf.Variable(tf.random_normal([n_hidden_3, n_hidden_4],stddev=STDDEV)),
'out': tf.Variable(tf.random_normal([n_hidden_4, n_classes],stddev=STDDEV)),
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'b3': tf.Variable(tf.random_normal([n_hidden_3])),
'b4': tf.Variable(tf.random_normal([n_hidden_4])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
# Build model
pred = mlp(X, weights, biases, dropout_keep_prob)
# Loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y)) # softmax loss
optimizer = tf.train.AdamOptimizer(learning_rate = LEARNING_RATE).minimize(cost)
# Accuracy
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Net built successfully...\n")
print("Starting training...\n")
#------------------------------------------------------------------------------
# Training
# Initialize variables
init_all = tf.initialize_all_variables()
# Launch session
sess = tf.Session()
sess.run(init_all)
# Training loop
for epoch in range(TRAINING_EPOCHS):
avg_cost = 0.
total_batch = int(data_train.shape[0] / BATCH_SIZE)
# Loop over all batches
for i in range(total_batch):
randidx = np.random.randint(int(TRAIN_SIZE), size = BATCH_SIZE)
batch_xs = data_train[randidx, :]
batch_ys = labels_train[randidx, :]
# Fit using batched data
sess.run(optimizer, feed_dict={X: batch_xs, y: batch_ys, dropout_keep_prob: 0.9})
# Calculate average cost
avg_cost += sess.run(cost, feed_dict={X: batch_xs, y: batch_ys, dropout_keep_prob:1.})/total_batch
# Display progress
if epoch % DISPLAY_STEP == 0:
print ("Epoch: %03d/%03d cost: %.9f" % (epoch, TRAINING_EPOCHS, avg_cost))
train_acc = sess.run(accuracy, feed_dict={X: batch_xs, y: batch_ys, dropout_keep_prob:1.})
print ("Training accuracy: %.3f" % (train_acc))
print ("End of training.\n")
print("Testing...\n")
#------------------------------------------------------------------------------
# Testing
test_acc = sess.run(accuracy, feed_dict={X: data_test, y: labels_test, dropout_keep_prob:1.})
print ("Test accuracy: %.3f" % (test_acc))
sess.close()
print("Session closed!")
view raw dmlp.py hosted with ❤ by GitHub

The model runs fine without problems, you can play around with parameters but I suggest using your GPU if you are not planning on spending the entire day running models.

Conclusions

The model seems to perform quite good with many different configurations yielding a consistent 86% accuracy score with different. This 86% score is lower than the 90% we’d have got by sticking to the most frequent class, but, at least in my opinion, it is more useful. Bear in mind that this score has been obtained using a balanced version of the original dataset. Getting back to the “real world”, if the aim of the bank (or the company in general) is to focus their efforts on the clients that are more likely to buy their products, then this model would be much more useful since it can give an “educated guess” on each client and seems to be around 80% accurate. Of course some more fine tuning and a cross validation process are necessary and would significantly improve the reliability of the model before putting it to actual use.

Also, as a sidenote, running these tests on a CPU makes you realize how badly you need a GPU ;)

The customizable version of the model used is available here.


The dataset has been downloaded from the following source(s):

1 [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

6 comments:

  1. IndexError: arrays used as indices must be of integer (or boolean) type. Getting this error. Please help me @abhisekroy1994@gmail.com

    ReplyDelete
  2. Solved that. Help me with this-
    ValueError: Only call `softmax_cross_entropy_with_logits` with named arguments (labels=..., logits=..., ...)

    ReplyDelete
    Replies
    1. dude, try to use an old release of the Tensorflow (like 0.12)! Seems like some functions from this code are deprecated in the latest versions of the library... OR, udapte the code! cheers

      Delete
  3. can i know the preprocessing commands you had used for the banks dataset

    ReplyDelete
    Replies
    1. Hi, essentially I just used pandas.get_dummies to convert categorical variable into dummy/indicator variables and then converted the binary outcome into a number (0/1).

      Delete
  4. Day One: Driving up to the Port of Southampton's Mayflower Terminal and catching first glimpse of the white-and-black hulled Queen Mary 2, the largest, longest, tallest, heaviest, and most expensive ship ever built, evoked considerable excitement and awe. Docked to port at a 50-degree, 54.25' north latitude and 001-degree, 25. Builders Gold Coast

    ReplyDelete