The aim of this article is providing a foretaste of the potentiality of machine learning algorithms using R, following step-by-step a standard procedure that, once got familiar, could be a good starting point to design customized models.
The idea behind each model, indeed, is the same. In a nutshell, it consists of finding an algorithm which fit data well, train it on part of our dataset (called training set), evaluate it on the remaining portion (test set) and then let it chew new data to make predictions. The algorithm, once fed with new data, will remember its previous knowledge and will improve itself without any further human interventions.
Needless to say, this definition fails to do justice to this phenomenal intersection between Statistics and Computer Science that is Machine Learning. However, this notion will be sufficient to let you size the structure of the experiment and, maybe, to make you more curious to dive deeper into the topic.
The code is provided at the end of this article, with some related comments.
So here there are the steps we are going to follow to set up our experiment:
- Downloading some data available on R environment
- Setting the task we want to solve
- Splitting labels and predictors into train and test sets
- Picking the most suitable algorithm for the chosen task
- Train, fit and finally test the model on the test set
- Evaluate its performance
Let’s begin. The dataset I’m going to use is the well-known Iris dataset. It includes a variation of Iris flowers of three related species. Let’s have a look at it (at the first 10 lines) and at some related stats:
The dataset contains 150 observations of flowers, each having four features (or independent variables) and one label, the specie (or dependent variable). Hence, we can already conclude that way of training we will use will be a ‘Supervised Learning’, since data are already labelled and the aim is deciding whether or not an observation belongs to a group.
We can also visualize our data, first considering only two features and then the whole set of independent variables:
There is plenty of analytics that can be performed on data to start inquiring about possible correlations, and only few of those will be discussed here. However, it is stunning how many information could be gathered just looking at some plots and stats, without even starting the process of training algorithms.
First, I’m willing to know whether my data are balanced. Indeed, facing imbalanced data implies some further interventions and smoothening procedures before manipulating them, in order not to have meaningless results. Furthermore, it’s always a good starting point checking the probability distribution of our variables, in case we want to run some tests afterwards. Finally, having a first visualization of possible correlations is a good approach to set the basis of our analysis.
Surprisingly, we can visualize all of these metrics with just one plot, using GGally and ggpairs:
Nice, isn’t it? With just few lines of code, we derived very meaningful information, namely the fact that data are perfectly balanced (look at the lower right graph).
Okay, now that we are familiar with our dataset, let’s split it. The idea is creating a validation test, where the model will be tested and will receive an evaluation, and a training set, where it will be fitted. However, since the main goal of this analysis is adapting and generalize our algorithms to new, unlabelled data, we want to make sure the training set we select is random, not biased. Hence, next step is splitting the training set itself into K folds: throughout an iterative rotation, the algorithm will be trained on K-1 folds and tested on the remaining ones, for K times. The error estimation will be averaged over all K trials, so that “lucky” rounds will be compensated with those “unlucky”. This approach is called Cross Validation.
The most important step of the whole process is formulating the right question and build the model accordingly. Here, my aim might be, once facing an unlabelled, unknown flower, having a set of rules which could tell me “look, since the flower has these features, it is 99% a Setosa”. This set of rules, or decision boundary, is nothing but the output of the model trained on those data.
Once set our task, next step is picking the most suitable algorithm. As we are facing a classification problem, the algorithm I’m going to employ is the Support Vector Machine (SVM), but be aware that this is neither the only solution nor necessarily the most accurate: R libraries offer a variety of algorithms recipes and, once targeted the family of algorithms you are interested in (in this case, classification), your decision will depend on the kind of task, data and dimensions you are facing (as well as on your personal tastes).
I decided for SVM since it is the most popular classifier, easy to visualize and with a very intuitive ground idea. This idea, in a nutshell, is finding a decision boundary, called hyperplane, which is able to segregate data in the most accurate way. The optimum hyperplane is the one which guarantees the largest “area of freedom” of other future observations, an area where they can afford to deviate from their pattern without undermining the model. This area, which represents the largest separation between classes, is called margin. Thus, we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized (yet under some constraints I’m not going to dwell on here).
Lots of words. Let’s visualize it with a simple graph:
Now it looks far more straightforward.
To make it even clearer, we can implement it on our data using only two features, Sepal Length and Petal Length, and only two labels, Setosa and Versicolor.
As you can see, the two clusters are clearly linearly separable. The SVM algorithm will build the hyperplane which will do that in the most general approach:
As the previous figure displays, the space is choppend into two pieces, segregating observations which are labelled as “Setosa” from those labelled as “Versicolor”. Now, observing this plot, we notice some datapoints are displayed as “x”. Does it have a special meaning? It does. Those points are the so called Support Vectors (SVs) and they are fundamental for our algorithm: actually, SVs are the only point which matter for the algorithm. It means that all the other observations could be moved from their current positions, without affecting the model, since, again, it is determined only by SVs.
Now let’s implement the same procedure for the whole training dataset and let’s display some related stats:
Besides accuracy (number of well-classified observations/total number of observations), some further parameters are displayed, namely C, Kappa, Sigma. However, here we are only interested in measuring accuracy, since our aim is evaluating model’s performance in terms of well-predicted observations.
We are now at our final step of the experiment: making predictions with the model we trained on our validation test, of which, remember, we already know the labels. Hence, we can immediately evaluate the performance of SVM.
Again, many metrics are displayed, but for now let’s consider only the confusion matrix: on the main diagonal lay all the well-categorized observations and we can immediately say that our model made pretty accurate predictions (indeed, the overall accuracy is 94.4%).
Could we say this solution is satisfying? Are we done now that we are happy with our algorithm? Well, it would be far too negligent training only one algorithm without comparing it to potential competitors. Furthermore, so many implementations and interventions should be made to make the elected algorithm more performing.
Nevertheless, the model we trained is far from being useless: with dignity, it made its job and returned good results. Even perfectionists couldn’t deny that, at least, these results might be a good starting point for further analyses and tests.
Conclusions? With a basic knowledge of algorithms’ families and keeping in mind the task we want to solve, building a machine learning model can be very straightforward and fast, maintaining at the same time a high quality performance.
Here there is the whole experiment coded in R: