Machine Learning models’ ultimate goal is making reliable predictions on new, unknown data. With this purpose in mind, we want our algorithm to capture relations in existing data and replicate them among new entries. At the same time, we do not want our algorithm to have, let’s say, prejudices because of the data it trained on.
In the first case, we are trying to reduce the bias of our model, which is the difference between the average prediction and the actual value. In the second case, we are trying to reduce the variance, which captures the fluctuations of predictions around the mean value. Both the measurements concur in increasing the error of our model, but, unfortunately, they are negatively correlated: you cannot decrease the one without increasing the other.
Let’s examine the statistical concepts of bias and variance more deeply.
As mentioned above, the bias is the difference between average prediction and actual value. To use statistical terminology, whenever we want to estimate a parameter α with an estimator α̂, we define the bias of that estimator as the difference between its average (or expected value) and the real parameter:
Hence, when the following condition is true:
We say the estimator is unbiased. Namely, the sample mean of a given population with mean μ and variance σ^2 is an unbiased estimator of the real mean μ. Indeed, given a sample of observations, independent and randomly selected, x1, x2….xn of the above population, we want to demonstrate our sample mean x̄ is an unbiased estimator of the real mean μ, hence:
So, keeping in mind that:
- the expected value of a sum of independent events is equal to the sum of the expected values of each event
- within our example, since the observations are randomly selected, the expected value of each observation is equal to the mean of the population
We can easily see how our estimator is unbiased, since:
Coming back to our ideal model, whenever we face a situation where the bias is close to zero, we have something like that:
As you can see, the difference between each true data point and the value predicted by the model is almost zero. However, this model is far from being well generalized: indeed, if new data points need to be predicted, our model, heavily biased by the train set, won’t fit these new data well. So, in the context of Machine Learning, bias is a type of error that occurs due to erroneous assumptions in the learning algorithm.
This measurement shows how far a set of data points are spread out from their average value. In numbers:
More specifically, in the context of Machine Learning, the variance is a type of error that occurs due to a model’s sensitivity to small fluctuations in the training set. If we look at the previous picture, we can see that, with low bias, a model will inevitably suffer a great value of variance, which leads to the problem of overfitting. On the contrary, if we face such a situation:
We will have, for sure, a lower value of variance (the model is not so sensitive to changes in data points: translated, if a data point suddenly deviates from the common pattern, our linear equation will not change its steep so much). However, the bias will be greater, since the model is missing some relevant relations between inputs and output.
To sum up:
- A low bias-high variance situation leads to overfitting: the model is too adapted to training data and won’t fit new data well;
- A high bias-low variance situation leads to underfitting: the model is not capturing all the relations useful to explain the output variables.
The problem is that we want to minimize both of them. Indeed, if we think about a Linear Regression, the optimization strategy is reducing the error, represented by the MSE:
But knowing that:
By adding those two quantities, we obtain exactly the MSE, as:
Where the third term, called irreducible error, it is a noise term existing among the true variables which cannot be captured by any model.
There are no easy recipes to find the perfect balance between variance and bias: each task and each algorithm might find a different equilibrium. Anyway, while training your model, you should always keep in mind that you want to avoid both underfitting and overfitting, aiming at the following result:
We are falling once more in one of those grey areas of machine learning where only concrete applications – as well as several attempts – can tell us which is the optimum equilibrium.