This notebook compares the efficacy and accuracy of two methods for estimating population size given a collection of samples taken without replacement. Additional details about these two methods ("BBC" and "Cuthbert") may be found elsewhere in the project documentation.
from itertools import product
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(
context='notebook',
style='darkgrid',
palette='mako',
)
plt.rcParams['figure.figsize'] = [8, 8]
import iceberg.simulate as sim
We're going to be working with a hypothetical population of 10,000 entities, from which we're going to draw a series of uniformly sized samples without replacement. That is, no entity can appear more than once in a given sample. In order to compare the efficacy of the two methods we will need to know, in addition to the true population size:
Correction via cross-validation is not necessary for Cuthbert's method in this situation; it is designed to test assumptions of randomness we know to be true in this instance.
The iceberg
package already has simulation functions that test this for us.
Let's see what happens with 100 samples of 50 entities.
np.random.seed(1729)
simulation = sim.random_samples(10000, 50, 100)
print(
'Entities Observed: {}\nCuthbert Estimate: {}\nBBC Estimate: {:.0f}'.format(
simulation['entities_observed'],
simulation['cuthbert'],
simulation['bbc']
)
)
Entities Observed: 3929 Cuthbert Estimate: 9854 BBC Estimate: 5303
The Cuthbert estimate is reasonably close, within 2% of the actual value. Far from a terrible performance, especially considering that fewer than two in every five entities were observed in at least one sample. The BBC estimate, on the other hand, is alarmingly off (by almost 50%). Is this a quirk of this particular sample set, or is it a marker of a more significant problem?
Let's run a bunch of simulations and see.
# Parameters to drive the simulations
pop_sizes = [10000, 25000]
sample_nums = [20, 50, 100, 200]
sample_sizes = [10, 25, 50, 100]
simulations_per_configuration = 32
# Logging structure
simulations = {
'pop_size': [],
'num_samples': [],
'sample_size': [],
'entities_observed': [],
'est_type': [],
'pop_est': []
}
# Run the indicated number of simulations for each configuration of parameters
np.random.seed(1729)
for pop_size, num_samples, sample_size, i in product(
pop_sizes,
sample_nums,
sample_sizes,
range(simulations_per_configuration)
):
simulation = sim.random_samples(pop_size, sample_size, num_samples)
simulations['pop_size'] += 2 * [pop_size]
simulations['num_samples'] += 2 * [num_samples]
simulations['sample_size'] += 2 * [sample_size]
simulations['entities_observed'] += 2 * [simulation['entities_observed']]
simulations['est_type'] += ['bbc', 'cuthbert']
simulations['pop_est'] += [
simulation['bbc'], simulation['cuthbert']
]
if i == simulations_per_configuration - 1:
print('Completed simulations with configuration: {}, {}, {}'.format(
pop_size, num_samples, sample_size
))
Completed simulations with configuration: 10000, 20, 10 Completed simulations with configuration: 10000, 20, 25 Completed simulations with configuration: 10000, 20, 50 Completed simulations with configuration: 10000, 20, 100 Completed simulations with configuration: 10000, 50, 10 Completed simulations with configuration: 10000, 50, 25 Completed simulations with configuration: 10000, 50, 50 Completed simulations with configuration: 10000, 50, 100 Completed simulations with configuration: 10000, 100, 10 Completed simulations with configuration: 10000, 100, 25 Completed simulations with configuration: 10000, 100, 50 Completed simulations with configuration: 10000, 100, 100 Completed simulations with configuration: 10000, 200, 10 Completed simulations with configuration: 10000, 200, 25 Completed simulations with configuration: 10000, 200, 50 Completed simulations with configuration: 10000, 200, 100 Completed simulations with configuration: 25000, 20, 10 Completed simulations with configuration: 25000, 20, 25 Completed simulations with configuration: 25000, 20, 50 Completed simulations with configuration: 25000, 20, 100 Completed simulations with configuration: 25000, 50, 10 Completed simulations with configuration: 25000, 50, 25 Completed simulations with configuration: 25000, 50, 50 Completed simulations with configuration: 25000, 50, 100 Completed simulations with configuration: 25000, 100, 10 Completed simulations with configuration: 25000, 100, 25 Completed simulations with configuration: 25000, 100, 50 Completed simulations with configuration: 25000, 100, 100 Completed simulations with configuration: 25000, 200, 10 Completed simulations with configuration: 25000, 200, 25 Completed simulations with configuration: 25000, 200, 50 Completed simulations with configuration: 25000, 200, 100
simulations = pd.DataFrame(simulations)
simulations.head(8)
pop_size | num_samples | sample_size | entities_observed | est_type | pop_est | |
---|---|---|---|---|---|---|
0 | 10000 | 20 | 10 | 198 | bbc | 277 |
1 | 10000 | 20 | 10 | 198 | cuthbert | 9440 |
2 | 10000 | 20 | 10 | 200 | bbc | 281 |
3 | 10000 | 20 | 10 | 200 | cuthbert | 20000 |
4 | 10000 | 20 | 10 | 199 | bbc | 279 |
5 | 10000 | 20 | 10 | 199 | cuthbert | 18940 |
6 | 10000 | 20 | 10 | 197 | bbc | 276 |
7 | 10000 | 20 | 10 | 197 | cuthbert | 6274 |
Now let's now compare the performance of the two methods in a systematic way by examining the mean and variance of the estimates produced by every collection of simulations run for each configuration of parameters.
grouped_simulations = simulations.groupby(
['pop_size', 'num_samples', 'sample_size', 'est_type']
)['pop_est'].agg([np.mean, np.std]).reset_index()
Let's take a look at the Cuthbert estimates first.
grouped_simulations[grouped_simulations['est_type'] == 'cuthbert']
pop_size | num_samples | sample_size | est_type | mean | std | |
---|---|---|---|---|---|---|
1 | 10000 | 20 | 10 | cuthbert | 12913.53125 | 6852.469849 |
3 | 10000 | 20 | 25 | cuthbert | 10712.84375 | 3017.472828 |
5 | 10000 | 20 | 50 | cuthbert | 10523.28125 | 1410.987420 |
7 | 10000 | 20 | 100 | cuthbert | 9676.90625 | 521.806438 |
9 | 10000 | 50 | 10 | cuthbert | 9814.15625 | 2350.238987 |
11 | 10000 | 50 | 25 | cuthbert | 10303.03125 | 1040.389830 |
13 | 10000 | 50 | 50 | cuthbert | 9966.00000 | 492.398475 |
15 | 10000 | 50 | 100 | cuthbert | 9948.71875 | 252.529172 |
17 | 10000 | 100 | 10 | cuthbert | 9825.09375 | 1114.793728 |
19 | 10000 | 100 | 25 | cuthbert | 9951.00000 | 512.452232 |
21 | 10000 | 100 | 50 | cuthbert | 9943.78125 | 300.866998 |
23 | 10000 | 100 | 100 | cuthbert | 10014.78125 | 99.667749 |
25 | 10000 | 200 | 10 | cuthbert | 10242.71875 | 755.936217 |
27 | 10000 | 200 | 25 | cuthbert | 9934.59375 | 296.679240 |
29 | 10000 | 200 | 50 | cuthbert | 10002.40625 | 118.800961 |
31 | 10000 | 200 | 100 | cuthbert | 9997.40625 | 45.426196 |
33 | 25000 | 20 | 10 | cuthbert | 17524.00000 | 4532.301701 |
35 | 25000 | 20 | 25 | cuthbert | 26398.28125 | 11262.689718 |
37 | 25000 | 20 | 50 | cuthbert | 26312.18750 | 6606.230175 |
39 | 25000 | 20 | 100 | cuthbert | 24928.40625 | 3276.330696 |
41 | 25000 | 50 | 10 | cuthbert | 28367.15625 | 11937.844902 |
43 | 25000 | 50 | 25 | cuthbert | 23586.50000 | 3368.768129 |
45 | 25000 | 50 | 50 | cuthbert | 25050.00000 | 2102.376796 |
47 | 25000 | 50 | 100 | cuthbert | 25130.06250 | 964.120891 |
49 | 25000 | 100 | 10 | cuthbert | 26865.06250 | 6128.224475 |
51 | 25000 | 100 | 25 | cuthbert | 25378.00000 | 2205.672760 |
53 | 25000 | 100 | 50 | cuthbert | 25080.34375 | 1183.615854 |
55 | 25000 | 100 | 100 | cuthbert | 25090.56250 | 518.572976 |
57 | 25000 | 200 | 10 | cuthbert | 25427.87500 | 2428.305047 |
59 | 25000 | 200 | 25 | cuthbert | 24902.31250 | 1141.679590 |
61 | 25000 | 200 | 50 | cuthbert | 24980.50000 | 496.330535 |
63 | 25000 | 200 | 100 | cuthbert | 25045.96875 | 235.058741 |
The performance is a little variable, especially when the number of samples and the sample sizes are both small (that is, when the number of distinct entities observed is small). However, the estimates do seem to become both more accurate and more consistent as those parameters increase; notice the stabilizing mean and the shrinking standard deviation.
Let's take a look at the BBC estimates.
grouped_simulations[grouped_simulations['est_type'] == 'bbc']
pop_size | num_samples | sample_size | est_type | mean | std | |
---|---|---|---|---|---|---|
0 | 10000 | 20 | 10 | bbc | 277.37500 | 3.139242 |
2 | 10000 | 20 | 25 | bbc | 682.18750 | 4.645063 |
4 | 10000 | 20 | 50 | bbc | 1330.00000 | 8.576036 |
6 | 10000 | 20 | 100 | bbc | 2505.71875 | 14.984367 |
8 | 10000 | 50 | 10 | bbc | 680.00000 | 5.327954 |
10 | 10000 | 50 | 25 | bbc | 1635.75000 | 11.609229 |
12 | 10000 | 50 | 50 | bbc | 3047.43750 | 20.468603 |
14 | 10000 | 50 | 100 | bbc | 5328.06250 | 35.364405 |
16 | 10000 | 100 | 10 | bbc | 1322.25000 | 8.647207 |
18 | 10000 | 100 | 25 | bbc | 3042.96875 | 20.482068 |
20 | 10000 | 100 | 50 | bbc | 5315.18750 | 42.899310 |
22 | 10000 | 100 | 100 | bbc | 8271.59375 | 40.152820 |
24 | 10000 | 200 | 10 | bbc | 2508.93750 | 20.572355 |
26 | 10000 | 200 | 25 | bbc | 5307.78125 | 41.870564 |
28 | 10000 | 200 | 50 | bbc | 8252.31250 | 46.985542 |
30 | 10000 | 200 | 100 | bbc | 10631.96875 | 38.822578 |
32 | 25000 | 20 | 10 | bbc | 279.62500 | 1.699146 |
34 | 25000 | 20 | 25 | bbc | 692.59375 | 4.031004 |
36 | 25000 | 20 | 50 | bbc | 1371.25000 | 6.530326 |
38 | 25000 | 20 | 100 | bbc | 2680.93750 | 15.020282 |
40 | 25000 | 50 | 10 | bbc | 693.03125 | 3.963416 |
42 | 25000 | 50 | 25 | bbc | 1699.34375 | 7.351144 |
44 | 25000 | 50 | 50 | bbc | 3311.31250 | 15.219761 |
46 | 25000 | 50 | 100 | bbc | 6270.71875 | 26.237420 |
48 | 25000 | 100 | 10 | bbc | 1370.96875 | 5.921445 |
50 | 25000 | 100 | 25 | bbc | 3311.93750 | 15.649255 |
52 | 25000 | 100 | 50 | bbc | 6262.46875 | 32.430555 |
54 | 25000 | 100 | 100 | bbc | 11237.56250 | 49.861057 |
56 | 25000 | 200 | 10 | bbc | 2678.96875 | 11.479250 |
58 | 25000 | 200 | 25 | bbc | 6254.40625 | 31.052359 |
60 | 25000 | 200 | 50 | bbc | 11216.34375 | 48.802835 |
62 | 25000 | 200 | 100 | bbc | 18231.34375 | 68.684698 |
This looks substantially worse. Not only are the estimates much less accurate on average, but their stability actually seems to decrease as more information becomes known. Perhaps more disturbingly, the estimates for the two population sizes are not that different from each other.
Let's take another view of these results by plotting estimates of population size against the number of distinct entities observed.
k_plot = sns.scatterplot(
data=simulations[simulations['pop_size'] == 10000],
x='entities_observed',
y='pop_est',
hue='est_type',
palette='mako'
)
k_plot.set(
title='Estimates of Population Size (True Size = 10,000)',
xlabel='Number of Entities Observed',
ylabel='Estimated Population Size'
)
k_plot.axhline(10000)
handles, labels = k_plot.get_legend_handles_labels()
k_plot.legend(handles, ['BBC', 'Cuthbert'], loc='upper right')
plt.show();
k_plot = sns.scatterplot(
data=simulations[simulations['pop_size'] == 25000],
x='entities_observed',
y='pop_est',
hue='est_type',
palette='mako'
)
k_plot.axhline(25000)
k_plot.set(
title='Estimates of Population Size (True Size = 25,000)',
xlabel='Number of Entities Observed',
ylabel='Estimated Population Size'
)
handles, labels = k_plot.get_legend_handles_labels()
k_plot.legend(handles, ['BBC', 'Cuthbert'], loc='upper right')
plt.show();
Not only do BBC estimates of population size appear to depend almost purely on the number of entities observed, their dependence almost seems linear. Cuthbert estimates, on the other hand, converge to the true population size fairly rapidly as the number of entities observed increases.
The Cuthbert method is clearly more accurate and behaves more sensibly. But there is more to be done with the BBC method; let us examine its seemingly linear behavior more closely.
bbc = simulations[simulations['est_type'] == 'bbc']
bbc_X = bbc['entities_observed'].values.reshape(-1, 1)
bbc_y = bbc['pop_est']
lm = LinearRegression()
lm.fit(bbc_X, bbc_y)
print('The regressed linear model is: pop_est = {:.4f}*entities_observed + {:.4f}'.format(lm.coef_[0], lm.intercept_))
print('The R^2 value is: {:.4f}'.format(lm.score(bbc_X, bbc_y)))
The regressed linear model is: pop_est = 1.3112*entities_observed + 109.4953 The R^2 value is: 0.9979
Such large R^2 values don't come along every day.
One incidental effect of this profound stability in the estimates of population size is that the BBC method will almost always yield close to the same estimate for the proportion of a population that has been observed. Based on the two ways of estimating that proportion tried below, it seems that the average estimate of survival that BBC will produce is somewhere between 0.7 and 0.8.
print('The BBC method will yield an average estimated survival rate of:')
print('{:.4f}, by the estimates, and'.format(
np.mean(bbc['entities_observed'] / bbc['pop_est'])
))
print('{:.4f}, by the regression.'.format(
1 / lm.coef_[0]
))
The BBC method will yield an average estimated survival rate of: 0.7293, by the estimates, and 0.7627, by the regression.