Bootstrap methods are powerful techniques used in non-parametric statistics, that means, whenever we are provided with data drawn from an unknown distribution law.
The underlying issue that bootstrap is meant to address is the well known problem of statistics: we want to collect information about a population, but we are provided only with a sample of this population. How can we be sure that this sample is representative of the whole population? Namely, if we compute the mean of our sample, does it well approximate the true mean of the population?
Moreover, we have to consider that, besides the statistics of means, standard deviations, Pearson coefficients etc., in a non-parametric framework the cumulative distribution function (CDF) is itself unknown, hence it automatically becomes one of the parameter to infer from our sample.
So, the idea of Bootstrap is that, instead of estimating our statistic only once, on the sample realization we obtained, we can do it many times on a re-sampling (with replacement) of the original sample. With this approach, repeated B times, we will obtain a vector of estimates of length B, of which we can compute the expected value, variance, empirical distribution and so forth.
So let’s proceed computing the bootstrap mean starting from a sample (X1,…,Xn) of independent and identically distributed random variables (we will hold this assumption for the whole procedure), drawn from a population according to an unknown distribution function F(x).
- We first work out the empirical distribution function, which is given by:
Where 1 is the indicator function which takes value=1 if Xi is less than t, 0 otherwise. An example of how an empirical CDF might look like is the following:
As you can see, each step has the same size, equal to 1/n.
- Then we draw, from our empirical CDF, a new sample of the same size of the original one. However, since each step of our empirical CDF is identical (1/n), sampling from the empirical CDF is the same as re-sampling (with replacement and equal probabilities) from the sample. We denote the re-sampled vector as (X*1, …, X*n).
- We re-sample from the latter B times and, for each set of X*, we compute our statistic of interest (in our case, the sample mean). Note that this phase is applying MonteCarlo methods to Bootstrapping.
- If n (sample size) and B (number of re-sampling) are sufficiently high, we can rely on the asymptotic properties of the summation of random variables (in particular, the Central Limit Theorem) and we can work out the distribution of our statistics. This latter is pivotal if we want to run Hypothesis tests about the likelihood of our statistic of being close to the value of the real parameter.
The following diagram might be useful to illustrate all the steps above:
Nice, now let’s implement it with Python.
For this purpose, I will generate a random vector which will be our population, whose law of distribution (expected to be the same as that of the population) is unknown. Then, I will pick a sample from our population and apply the Bootstrapping procedure:
np.random.seed(123) pop = np.random.randint(0,500 , size=1000) sample = np.random.choice(pop, size=300) #so n=300
Now I should compute the empirical CDF, so that I can sample from it. However, as we said above, sampling from empirical CDF is the same as re-sampling with replacement from our original sample, hence:
sample_mean =  for _ in range(10000): #so B=10000 sample_n = np.random.choice(sample, size=300) sample_mean.append(sample_n.mean())
I basically created an empty list and, for each re-sampling of my initial sample, I appended its sample mean to that list. Now let’s have a look at the distribution and expected value of our vector of means (which is nothing but a random variable itself):
The distribution seems normal, and this is exactly what we were expecting: because of the Central Limit Theorem when independent random variables are added, their sum tends toward a normal distribution even if the original variables themselves are not normally distributed.
Then, if we compare the true mean with the bootstrapped one we obtain:
np.mean(sample_mean) 255.73952966666664 pop.mean() 253.241
As you can see, it is pretty accurate. Plus, if we retrieve the mean of the original sample:
We see how less accurate the latter is. Probably, the original sample drawn from the population was not that representative.
Bootstrap sampling is an important technique to bypass the non-parametric approach’s issues. Indeed, even though with a non-parametric approach we are “relaxing” some strict assumptions needed in case of a parametric framework, we pay this extra flexibility in terms of difficulty of estimating population’s features.