Monday 24 August 2015

RandomForestClassifier on the cars dataset ML

Since the beginning of this summer I have been practicing a lot with Scikit-learn to improve my knowledge of Machine Learning both on theory and practice. Last week I also tried to tackle some of the Kaggle competitions although they are really tough if you wish to get into the top 50 best scores. Perhaps I will post something about the experience in the future.

Scikit-learn is a (almost) ready to use package for Machine Learning in Python. It is so very user friendly and in some cases not that much coding around is needed to achieve interesting results such as in the case of the cars dataset.

The cars dataset, from the UCI Machine Learning Repository, is a collection of about 1700 entries of cars each with 6 features that can be easily recognized by the name (buying, maint, doors, persons, lug_boot, safety). Check the dataset description for more detailed information. The feature to be predicted is “class” and the possible values are unacc,acc,good,v-good. Most of the features are categorical, therefore they need to be encoded into numbers. Pandas is great for quick features encoding.


As I said, categorical features need to be encoded into numbers, and then you need to be sure that those numbers are indeed numbers (you need to check the type, both float and int are fine).

It is usually better, before running most of ML algorithms, to scale the data, either normalizing it between 0 and 1 or reducing each feature to have mean equals to 0 and standard deviation equals to 1. This process is very important for neural networks in R: sometimes the algorithms does not even converge if you skip the scaling process! Other weird things include skewed predictions and long computing times.
In some cases such as this one, scaling is not that crucial since numbers are already small and the scaling process can be view as the next step into the optimization of your script.

For this example I am using a random forest classification algorithm. Random forests are an ensemble learning method which is usually very good at classification tasks. It basically operates by using a lot of decision trees.

Here is my basic implementation:

By running the script we obtain a 98% score on the testing set. Not bad for such a simple script! Bear in mind that when you start to work on other datasets things are not always that great and you might need to change model and tune the parameters a little using Gridsearch or similar methods.

As you start to dive deeper into ML you will immediately notice how much time is spent on feature selection, data cleaning, preprocessing and parameter tuning. In particular, as a novice I noted how big and time consuming is (at least initially) the improvement of a model. For instance, to improve our prediction above, we might want to scale the data, do some cross validation and look for the “best” parameters. We also might want to test our model on fresh new data in order to be sure that it is not overfitting the data:100% accuracy is desirable, however with no new data and no additional information we do not know if our model is good or it is overfitting the training data.

Soon I’ll be talking more about ML and my experiments.

 


The dataset has been downloaded from the following source:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

No comments:

Post a Comment