Machine Learning as a subject is not easy. It is indeed a set of tools (mainly algorithms and optimization procedures) whose comprehension involves, inevitably, a deep understanding of Maths and Stats.
Nevertheless, the implementation of a ML model to a real scenario might be easier than expected. Indeed, once you got familiar with theoretical concepts, you will be able to use pre-built packages and utilities available in Python. In other words, to build a basic model, you don’t have to be a ninja in Python: the most important thing is understanding the underlying problem and develop a theory to solve it. Then Python will do the hard job for you.
In this article, I’m going to show you how to build a ML pipeline step by step, and then I’ll show you how to ‘envelope’ all the steps in 3 lines of codes. For this purpose, I’m going to use the Red Wine Quality dataset, available on Kaggle.
So let’s visualize it:
import pandas as pd df = pd.read_csv('winequality-red.csv') df.head()
The variables can be interpreted as follows:
Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
We are dealing with a supervised, classification task, since the target variable is categorical and not continuous. Here, I’m going to use as classification algorithm the Support Vector Machine (SVM), which I introduced in my previous article here.
So let’s start.
From our dataset, we can see that there are 11 features and 1 label. There are two things that come to mind looking at it:
- The eleven features take values on different scales: namely, total sulfure dioxide exhibits 2-digits integers, while chlorides takes values less than 1. We can easily visualize this evidence with a boxplot:
import matplotlib.pyplot as plt import seaborn as sns data_m=pd.melt(df) fig = sns.boxplot(x='variable', y='value', data=data_m) fig.set_xticklabels(fig.get_xticklabels(), rotation = 45)
This tends to deviate the SVM coefficients from those which describe the most efficient hyperplane. Hence, we might want to scale our variables:
#let's separate features from labels: from sklearn.model_selection import train_test_split X = df.drop('quality', axis=1) y=df['quality'] #let's create train and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123) #let's scale our variable: from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)
- The second thing to note is the elevate number of explanatory variables, eleven. Because the final goal of ML models is to well predict on new, unseen data, the risk of overfitting a model with too many parameters is high. Hence, we might want to drop some of our features, yet without losing relevant information. A very powerful technique for dimensionality reduction is the Principal Component Analysis (PCA), and it is what we are going to apply here:
from sklearn.decomposition import PCA pca = PCA(n_components=2) #reducing dimensionality from 11 to 2 principalComponents = pca.fit_transform(X_train_scaled) first_component, second_component = principalComponents[:,0], principalComponents[:,1]
Nice, we are now ready to deploy our model:
clf = svm.SVC() clf.fit(principalComponents, y_train) clf.score(principalComponents, y_train) Output: 0.548704200178731
The final output is the accuracy of our model, that means, the percentage of correctly classified input within the train set.
The very last thing to implement refers to the splitting criteria we have been setting. Indeed, we don’t know how our SVM would have been performing, had it been trained on a different train set. How can we reach a score which is representative of more than one possible splitting criterion? Well, it might be reached through the technique of Cross Validation: it consists of splitting the train set into K-folds, then the model is trained on K-1 folds and tested on the remaining one. The final score will be the average of all the scores reached in each of the K iterations. In that way, ‘unlucky’ train sets will be compensated by ‘lucky’ ones.
Let’s see how to implement it during the training phase:
import numpy as np from sklearn.model_selection import cross_val_score score_scaled = cross_val_score(svm.SVC(), X_train_scaled, y_train, cv=5) np.mean(score_scaled) Output: 0.619408015250555
As you can see, now the accuracy is greater.
Nice, we have been writing and building our model step by step, but those are more than the promised 3 lines of code. Hence, let’s see how to compact everything in a very short and nice procedure.
Building a Pipeline
Scikit-learn, among its useful packages, offers the possibility to envelope all the steps we have been talking about (and any other transformation you might want to apply to your data) in a pipeline, which can be easily imported as follows:
from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), PCA(n_components = 2), svm.SVC()) scores_pipe = cross_val_score(pipe, X_train, y_train, cv = 5)
And that’s it. Basically, we put all our transformations into the pipeline, that’s the reason why, while cross-validating, we put X_train rather than X_train_scaled: since we are using our ‘pipe’ as first argument, all the transformations are automatic.
Using Scikit-learn Pipelines is a smart shortcut which can save lot of your time. Nevertheless, you have to have clear in mind which steps you want to implement, so that you can replicate it in your pipe.
For more indications about Pipelines and Scikit-learn packages, you can read the official documentation here.
References for the dataset:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.