Python is a general purpose language and, as such, it offers a great number of extensions which range from scientific programming to data visualization, from statistical tools to machine learning.
It is almost impossible knowing every available extension, however there are a few of them which are pivotal if your task consists of analyzing data and build machine learning models on them.
Hence, in this article I’m dwelling on 5 main packages which will make your Python extremely versatile, that are:
- Numpy
- Pandas
- Matplotlib
- Scikit-learn
- Seaborn
So let’s start!
Numpy
Numpy is a package for scientific computing. It allows you to perform any kind of mathematical and statistical operations. In particular (and this is the reason why it is fundamental in Machine Learning), it allows you to perform N-dimensional computations very quickly and easily. Anytime you are asked to manipulate vectors and matrices, you know you have to use Numpy for that purpose.
Let’s now see some examples.
Numpy’s main object is the homogeneous multidimensional array, which might look as either a vector (if dimensions are n,1) or a matrix (if dimensions are n,m). Let’s create a first array containing 1,2,3:
import numpy as np
a=np.array([1,2,3])
a
Output: array([1, 2, 3])
We can check some properties of this array with the following methods:
a.shape
Output: (3,)
a.dtype
Output: dtype('int32')
type(a)
Output: numpy.ndarray
We can also initialize arrays specifying the number of components and the shape. Namely, if I want to create a 3×4 matrix with number from 0 to 11, I will write:
b=np.arange(12).reshape(3,4)
b
Output: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
You can also convert into arrays objects that are not, like lists. This is very useful whenever you have to make computation with items of lists. Let’s say that you want to subtract those two lists:
list_1=[1,2,3]
list_2=[4,5,6]
list_2-list_1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-4c6e19f7626c> in <module>
1 list_1=[1,2,3]
2 list_2=[4,5,6]
----> 3 list_2-list_1
TypeError: unsupported operand type(s) for -: 'list' and 'list'
As you can see, an error arose since you cannot subtract two lists. But, if you use this trick:
np.asarray(list_2)-np.asarray(list_1)
Output: array([3, 3, 3])
The problem is bypassed!
Pandas
Pandas provides data structures and data analysis tools which are essential for your data to be cleaned and employed for machine learning tasks.
The main objects in pandas are DataFrames, which are nothing but structured datasets which can be easily modified and accessed. You can either create or import (from web, csv files, text files…) your dataframe.
Let’s create one from scratch:
import pandas as pd
data = [['alex', 10], ['tom', 15], ['jim', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df

We can access the elements of this df as it was a matrix:
df[:1] #showing only the first row

df.iloc[:,1]
Output:
0 10
1 15
2 14
Name: Age, dtype: int64
All the columns of a pandas dataframe are Series objects:
type(df['Age'])
Output: pandas.core.series.Series
We can also append new columns to our dataset, as well as setting indexes:
g=['M','M','M']
df['Gender']=g
df.set_index('Name',inplace=True)
df

Pandas is fundamental whenever you deal with huge amount of data, since it is also able to summarize relevant information (like the presence of missing values, outliers, mean and frequencies and so forth).
df.isnull().sum() #for missing values
Output:
Age 0
Gender 0
dtype: int64
df.describe()

Matplotlib
Matplotlib offers different tools for data visualization. It is not the only visualization package available in Python, nevertheless it is the most intuitive to use and it generates very nice results.
Let’s see how to plot different graphs:
import matplotlib.pyplot as plt
import numpy as np
men_means = [20, 34, 30, 35, 27]
x = np.arange(len(labels))
fig, ax = plt.subplots()
ax.bar(x - width/2, men_means, width, label='Men')
ax.set_title('Men Means')

We can also show multiple bars in the same graph:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
men_means = [20, 34, 30, 35, 27]
women_means = [25, 32, 34, 20, 25]
x = np.arange(len(labels))
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')
ax.set_title('Men and Women Means')

Now let’s model a normal random variable with a histogram and normal distribution plot approximation:
import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 30, normed=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r')
plt.show()

Now imagine we want to plot the result of a survey were people had to say their favorite italian pasta. The result of the survey are:
import matplotlib.pyplot as plt
labels = 'Gnocchi', 'Tortellini', 'Spaghetti', 'Penne'
sizes = [15, 30, 45, 10]
explode=(0,0,0,0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal')
plt.show()

You can also emphasize the most popular answer with the option explode:
explode=(0,0,0.1,0)

As data scientist, data visualization is pivotal since you will always have to show your results in an intuitive and powerful way. Furthermore, relevant graphs are often useful to identify pattern in data even before starting building models, hence they might suggest you which kind of analysis you might run.
Scikit-Learn
This is probably the most important package for machine learning, since it provides all the algorithms, ranging from supervised to unsupervised, from classification to regression. Plus, it includes evaluation metrics such as ROC, MSE R squared and so forth, which will be automatically computed after each training of your algorithm.
Let’s see a very easy example of ML task, using the Boston House Price dataset and trying to model the price with respect to just one variable, so that we can visualize it. As it being a regression task (the target variable ‘price’ is continuous), we will use a Simple Linear Regression:
import pandas as pd
from sklearn.datasets import load_boston
dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
df.head()

Since we want to build a simple linear regression (only one feature), we need to reduce dimensionality from 13 to 1, and to do so without loosing relevant information we need to run a Principal Component Analysis:
from sklearn.decomposition import PCA
pca = PCA(1)
projected = pca.fit_transform(dataset.data)
print(dataset.data.shape)
print(projected.shape)
Output:
(506, 13)
(506, 1)
Nice, now let’s import and train our model:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(projected, dataset.target, random_state=0)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
y_pred = lm.predict(X_test)
#let's visualize the results
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.xlabel('First Principal component')
plt.ylabel('price')
plt.show()

We can also ask for a ‘feedback’ of the performance of our algorithm:
from sklearn.metrics import mean_squared_error, r2_score
print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("R2: {:.2f}".format(r2_score(y_test, y_pred)))
Output:
MSE: 73.04
R2: 0.11
As you can see, with a few lines of code (and less than 2 minutes) we trained a ML model without any manual computation.
Seaborn
Like matplotlib, seaborn is a Python packages for data visualization. However, it is meant to be particularly useful for statistical representations, and it returns more relevant information about your data.
In particular, it is very handy for showing possible correlations among data: with pairplot() and heatmap() you can have a first, significant glimpse of relationships among all the features (and targets):
import seaborn as sns
sns.set(style="ticks")
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

sns.heatmap(df.corr(),annot=True)

We can also visualize the joint distribution of data (let’s say, of the two features sepal_length and sepal_width):
sns.jointplot(x='sepal_length',y='sepal_width',data=df,size=5)

Finally, let’s have a look at the distribution of sepal_length values for each specie:
ax=sns.boxplot(x='species',y='sepal_length',data=df)
ax=sns.stripplot(x='species',y='sepal_length',data=df,jitter=True,edgecolor='gray')

Seaborn is extremely quick and powerful to display relevant information and, if you are performing exploratory analysis, it might save a lot of your time, suggesting you clues about the best algorithm you could pick.
Needless to say, covering all the potentialities of those packages would be almost impossible. However, it is important to know which are the tools you need and how to deploy them during your analysis. As a good practice, remember that whatever kind of computation you need for your analysis, python provides a quickest and smartest way to do so: discovering by doing is a very good strategy to explore those tools.
If you are interested in learning more about those packages, here there are the official documentations: