The Beginner Programmer: Building a (reusable?) deep neural network model using Tensorflow

I’ve been experimenting for more than two months with Tensorflow, and while I find it a bit more “low level” if compared to other libraries for machine learning, I like it and hopefully I am getting better at using it. During the learning process I found some minor “obstacles” so I decided to write a short tutorial on how to use this amazing deep learning library.

At the beginning, Tensorflow was a bit of an headache for me since I wasn’t really used to the concept of a computational graph but most of all I could not find tutorials using data other than the MNIST dataset or some built-in dataset. The fact that there are built-in functions that return the training and testing data in a “ready to be fed” format, leaves out the explanation on how to proceed if the dataset you need to use is not a series of pixel readings such as MNIST, for instance, contains a series different categorical variables and real variables.

The model I’m going to walk you through seems to be quite flexible and, at least to me, reusable, provided that your data is in a “standard format” (see the the code below for more information on this).

Things to keep in mind when approaching Tensorflow from scratch:

1. Knowing how to use Pandas, Scikit-learn and Numpy is very useful and can definitely improve yor experience using Tensorflow. I think Numpy is the only mandatory requirement, but if you want to have a simpler life during the preprocessing stages, then I would strongly suggest you get in touch with the other two libraries as well.
2. There aren’t “prepackaged” models, e.g. Scikit-learn style, fit and predict models, with Tensorflow you are given the building blocks for building your own model from the basic operations. So you are basically on your own! While this path may require a bit more practice and patience, it enables you to get more customizable models and, at least in my experience, a better hands on experience. You can then build your own collection of “ready to go” models which is way cooler! By the way, there are things like Skflow that are inspired from Scikit-Learn, but if you plan on using neural networks I personally believe learning Tensorflow is more valuable.
3. Is it really necessary to use Tensorflow/a deep neural network? Not always. Sometimes using Tensorflow seems like, to quote an interesting post I read lately, bringing a tank to a knife fight. Simpler models are usually preferable, sometimes are more than enough and sometimes they are even better than complex models, but neural networks are cool, so no arguments here.

The (unbalanced) dataset

I’m going to use a dataset publicly available at UCI machine learning repository.

The dataset I’m going to use is the Bank Marketing Data Set which contains data on the marketing campaign of a portuguese banking institution. The dataset has 20 mixed type predictors (real valued, binary categorical and multivariable categorical variables) and the variable to be predicted is a binary response on whether the client has subscribed a deposit or not. This is a nice unbalanced dataset, the binary response is splitted 10-90%.

The number of instances is 45211. Nice sidenote: a smaller dataset (10% of the original one) is provided, this is awesome if you need to run some quick tests!

The preprocessing steps

As far as preprocessing is concerned, the first thing I had to do was converting the categorical variables in a format that could be fed to Tensorflow, otherwise, be sure of this, it will scream at you and throw every possible exception you could imagine Smile . The get_dummies() method available in pandas was really useful in achieving this.

Then I normalized the data using sklearn.

	import pandas as pd
	import numpy as np
	from sklearn import preprocessing

	# Load data
	data = pd.read_csv('bank-additional-full.csv', sep = ";")
	# Variables names
	var_names = data.columns.tolist()

	# Categorical vars
	categs = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']
	# Quantitative vars
	quantit = [i for i in var_names if i not in categs]

	# Get dummy variables for categorical vars
	job = pd.get_dummies(data['job'])
	marital = pd.get_dummies(data['marital'])
	education = pd.get_dummies(data['education'])
	default = pd.get_dummies(data['default'])
	housing = pd.get_dummies(data['housing'])
	loan = pd.get_dummies(data['loan'])
	contact = pd.get_dummies(data['contact'])
	month = pd.get_dummies(data['month'])
	day = pd.get_dummies(data['day_of_week'])
	poutcome = pd.get_dummies(data['poutcome'])

	# Map variable to predict
	dict_map = dict()
	y_map = {'yes':1,'no':0}
	dict_map['y'] = y_map
	data = data.replace(dict_map)
	label = data['y']

	df1 = data[quantit]
	df1_names = df1.keys().tolist()

	# Scale quantitative variables
	min_max_scaler = preprocessing.MinMaxScaler()
	x_scaled = min_max_scaler.fit_transform(df1)
	df1 = pd.DataFrame(x_scaled)
	df1.columns = df1_names

	# Get final df
	final_df = pd.concat([df1,
	job,
	marital,
	education,
	default,
	housing,
	loan,
	contact,
	month,
	day,
	poutcome,
	label], axis=1)

	# Quick check
	print(final_df.head())

	# Save df
	final_df.to_csv('bank_normalized.csv', index = False)

view raw preprocessing_b.py hosted with ❤ by GitHub

As you can see, sklearn and pandas are not needed, but they save a lot of time and coding.

Fun fact: if you try to run the model on the full dataset, at this stage, you’ll get 90% accuracy. How much is this result worth though? Very little, I would argue, and here’s why: the dataset is heavily unbalanced, if you just pick one answer and stick to it every time, you’ll end up in one case with a 90% accuracy and in the other with a 10% accuracy, on average. It would be wise to check that your classifier is not using this (perhaps attractive) strategy, since in that case it wouldn’t be doing much of a good job, especially if our objective is to predict whether a certain customer will or will not buy a deposit. If indeed it used that strategy, the classifier would always say no. It would be right 90% of the times, of course, but of what use would such a silly classifier be? None, I would argue.

For the reason stated above, I decided to built a 50% - 50% new dataset that, although using less data, could provide a better solution to the problem. This route is quickly but wastes precious data. Another possibility would have been training the classifier by making the less frequent class appear more in the training data compared to the other one. There are different methods, but since this was a quick exercise I chose this route which then resulted in an acceptable result.

The model

The model is a simple deep multi layer perceptron. I made a little effort in trying to make the model a bit customizable. For instance, as long as your data is in a .csv format as specified, the model will be able to deal with it and you just need to input the path to the file. The activation functions within layers and in the output layer can be customized as well. The model assumes that all the variables, except from the Y_LABEL, are predictors to be used in the process.

	import tensorflow as tf
	import pandas as pd
	from sklearn.cross_validation import train_test_split

	FILE_PATH = '~/Desktop/bank-add/bank_equalized.csv' # Path to .csv dataset
	raw_data = pd.read_csv(FILE_PATH) # Open raw .csv

	print("Raw data loaded successfully...\n")
	#------------------------------------------------------------------------------
	# Variables

	Y_LABEL = 'y' # Name of the variable to be predicted
	KEYS = [i for i in raw_data.keys().tolist() if i != Y_LABEL] # Name of predictors
	N_INSTANCES = raw_data.shape[0] # Number of instances
	N_INPUT = raw_data.shape[1] - 1 # Input size
	N_CLASSES = raw_data[Y_LABEL].unique().shape[0] # Number of classes (output size)
	TEST_SIZE = 0.1 # Test set size (% of dataset)
	TRAIN_SIZE = int(N_INSTANCES * (1 - TEST_SIZE)) # Train size
	LEARNING_RATE = 0.001 # Learning rate
	TRAINING_EPOCHS = 400 # Number of epochs
	BATCH_SIZE = 100 # Batch size
	DISPLAY_STEP = 20 # Display progress each x epochs
	HIDDEN_SIZE = 200 # Number of hidden neurons 256
	ACTIVATION_FUNCTION_OUT = tf.nn.tanh # Last layer act fct
	STDDEV = 0.1 # Standard deviation (for weights random init)
	RANDOM_STATE = 100 # Random state for train_test_split

	print("Variables loaded successfully...\n")
	print("Number of predictors \t%s" %(N_INPUT))
	print("Number of classes \t%s" %(N_CLASSES))
	print("Number of instances \t%s" %(N_INSTANCES))
	print("\n")
	print("Metrics displayed:\tPrecision\n")
	#------------------------------------------------------------------------------
	# Loading data

	# Load data
	data = raw_data[KEYS].get_values() # X data
	labels = raw_data[Y_LABEL].get_values() # y data

	# One hot encoding for labels
	labels_ = np.zeros((N_INSTANCES, N_CLASSES))
	labels_[np.arange(N_INSTANCES), labels] = 1

	# Train-test split
	data_train, data_test, labels_train, labels_test = train_test_split(data,
	labels_,
	test_size = TEST_SIZE,
	random_state = RANDOM_STATE)

	print("Data loaded and splitted successfully...\n")
	#------------------------------------------------------------------------------
	# Neural net construction

	# Net params
	n_input = N_INPUT # input n labels
	n_hidden_1 = HIDDEN_SIZE # 1st layer
	n_hidden_2 = HIDDEN_SIZE # 2nd layer
	n_hidden_3 = HIDDEN_SIZE # 3rd layer
	n_hidden_4 = HIDDEN_SIZE # 4th layer
	n_classes = N_CLASSES # output m classes

	# Tf placeholders
	X = tf.placeholder(tf.float32, [None, n_input])
	y = tf.placeholder(tf.float32, [None, n_classes])
	dropout_keep_prob = tf.placeholder(tf.float32)


	def mlp(_X, _weights, _biases, dropout_keep_prob):
	layer1 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(_X, _weights['h1']), _biases['b1'])), dropout_keep_prob)
	layer2 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(layer1, _weights['h2']), _biases['b2'])), dropout_keep_prob)
	layer3 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(layer2, _weights['h3']), _biases['b3'])), dropout_keep_prob)
	layer4 = tf.nn.dropout(tf.nn.tanh(tf.add(tf.matmul(layer3, _weights['h4']), _biases['b4'])), dropout_keep_prob)
	out = ACTIVATION_FUNCTION_OUT(tf.add(tf.matmul(layer4, _weights['out']), _biases['out']))
	return out

	weights = {
	'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1],stddev=STDDEV)),
	'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2],stddev=STDDEV)),
	'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3],stddev=STDDEV)),
	'h4': tf.Variable(tf.random_normal([n_hidden_3, n_hidden_4],stddev=STDDEV)),
	'out': tf.Variable(tf.random_normal([n_hidden_4, n_classes],stddev=STDDEV)),
	}

	biases = {
	'b1': tf.Variable(tf.random_normal([n_hidden_1])),
	'b2': tf.Variable(tf.random_normal([n_hidden_2])),
	'b3': tf.Variable(tf.random_normal([n_hidden_3])),
	'b4': tf.Variable(tf.random_normal([n_hidden_4])),
	'out': tf.Variable(tf.random_normal([n_classes]))
	}

	# Build model
	pred = mlp(X, weights, biases, dropout_keep_prob)

	# Loss and optimizer
	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y)) # softmax loss
	optimizer = tf.train.AdamOptimizer(learning_rate = LEARNING_RATE).minimize(cost)

	# Accuracy
	correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	print("Net built successfully...\n")
	print("Starting training...\n")
	#------------------------------------------------------------------------------
	# Training

	# Initialize variables
	init_all = tf.initialize_all_variables()

	# Launch session
	sess = tf.Session()
	sess.run(init_all)

	# Training loop
	for epoch in range(TRAINING_EPOCHS):
	avg_cost = 0.
	total_batch = int(data_train.shape[0] / BATCH_SIZE)
	# Loop over all batches
	for i in range(total_batch):
	randidx = np.random.randint(int(TRAIN_SIZE), size = BATCH_SIZE)
	batch_xs = data_train[randidx, :]
	batch_ys = labels_train[randidx, :]
	# Fit using batched data
	sess.run(optimizer, feed_dict={X: batch_xs, y: batch_ys, dropout_keep_prob: 0.9})
	# Calculate average cost
	avg_cost += sess.run(cost, feed_dict={X: batch_xs, y: batch_ys, dropout_keep_prob:1.})/total_batch
	# Display progress
	if epoch % DISPLAY_STEP == 0:
	print ("Epoch: %03d/%03d cost: %.9f" % (epoch, TRAINING_EPOCHS, avg_cost))
	train_acc = sess.run(accuracy, feed_dict={X: batch_xs, y: batch_ys, dropout_keep_prob:1.})
	print ("Training accuracy: %.3f" % (train_acc))


	print ("End of training.\n")
	print("Testing...\n")
	#------------------------------------------------------------------------------
	# Testing

	test_acc = sess.run(accuracy, feed_dict={X: data_test, y: labels_test, dropout_keep_prob:1.})
	print ("Test accuracy: %.3f" % (test_acc))

	sess.close()
	print("Session closed!")

view raw dmlp.py hosted with ❤ by GitHub

The model runs fine without problems, you can play around with parameters but I suggest using your GPU if you are not planning on spending the entire day running models.

Conclusions

The model seems to perform quite good with many different configurations yielding a consistent 86% accuracy score with different. This 86% score is lower than the 90% we’d have got by sticking to the most frequent class, but, at least in my opinion, it is more useful. Bear in mind that this score has been obtained using a balanced version of the original dataset. Getting back to the “real world”, if the aim of the bank (or the company in general) is to focus their efforts on the clients that are more likely to buy their products, then this model would be much more useful since it can give an “educated guess” on each client and seems to be around 80% accurate. Of course some more fine tuning and a cross validation process are necessary and would significantly improve the reliability of the model before putting it to actual use.

Also, as a sidenote, running these tests on a CPU makes you realize how badly you need a GPU ;)

The customizable version of the model used is available here.

The dataset has been downloaded from the following source(s):

¹ [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

6 comments:

Unknown27 December 2017 at 12:27
IndexError: arrays used as indices must be of integer (or boolean) type. Getting this error. Please help me @abhisekroy1994@gmail.com
Unknown27 December 2017 at 17:58
Solved that. Help me with this-
ValueError: Only call `softmax_cross_entropy_with_logits` with named arguments (labels=..., logits=..., ...)
jesho's blog4 April 2018 at 11:53
can i know the preprocessing commands you had used for the banks dataset
NicheExpert17 May 2018 at 09:03
Day One: Driving up to the Port of Southampton's Mayflower Terminal and catching first glimpse of the white-and-black hulled Queen Mary 2, the largest, longest, tallest, heaviest, and most expensive ship ever built, evoked considerable excitement and awe. Docked to port at a 50-degree, 54.25' north latitude and 001-degree, 25. Builders Gold Coast

The Beginner Programmer

Pages

Sunday, 11 September 2016

Building a (reusable?) deep neural network model using Tensorflow

The (unbalanced) dataset

The preprocessing steps

The model

Conclusions

6 comments: