This notebook uses the functionality in the iceberg
package to estimate survival rates for literate song across various languages in fifteenth-century Europe.
The capta are sourced from the alpha version of the Digital Index of Late Medieval Song (DILMS).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(
context='notebook',
style='darkgrid',
palette='mako',
)
import iceberg.estimate as est
The capta in question are fairly simple, just pairs of SongName
and SourceName
for several linguistic buckets (these became song_name
and source_name
in later releases of DILMS).
A sample of the captaset for all languages is shown below; for more details about the delicate and fraught nature of language in DILMS see that project's documentation.
# Sample of all-language captaset
pd.read_csv('../../DILMS Captasets/all_15c.csv').head(8)
SongName | SourceName | |
---|---|---|
0 | Herte myne well may thou be | Strasbourg Index |
1 | Tappster dryngker fyll another ale annon | Selden Carol Book |
2 | So ys emprentid in my remembrance | Berlin Chansonnier |
3 | So ys emprentid in my remembrance | Escorial B |
4 | So ys emprentid in my remembrance | Florence 176 |
5 | So ys emprentid in my remembrance | Ascoli Piceno Fragment |
6 | So ys emprentid in my remembrance | Laborde Chansonnier !1 |
7 | So ys emprentid in my remembrance | Montecassino 871 |
Let's import all of the linguistically-segmented captasets, estimate survival rates using 256 cross-validation experiments, and plot.
# Prep
languages = ['all', 'english', 'french', 'german', 'italian']
estimates = {
'Language': [],
'extant': [],
'estimate': []
}
# Process capta
for language in languages:
capta = pd.read_csv('../../DILMS Captasets/' + language + '_15c.csv')
pairs = capta.values.tolist()
source_d= {}
songs = set()
for pair in pairs:
songs.add(pair[0])
if pair[1] not in source_d:
source_d[pair[1]] = [pair[0]]
else:
source_d[pair[1]].append(pair[0])
sources = list(source_d.values())
np.random.seed(1729)
ice = est.cuthbert(sources, cv=256)
for i in range(len(ice['corrected'])):
estimates['Language'].append(language.capitalize())
estimates['extant'].append(len(songs))
estimates['estimate'].append(ice['corrected'][i])
estimates = pd.DataFrame(estimates)
# Inspect result
estimates.head()
Language | extant | estimate | |
---|---|---|---|
0 | All | 2185 | 2617 |
1 | All | 2185 | 2661 |
2 | All | 2185 | 2596 |
3 | All | 2185 | 2731 |
4 | All | 2185 | 2645 |
# Add column indicating the estimated survival rate
estimates['est_p'] = estimates['extant'] / estimates['estimate']
estimates.head()
Language | extant | estimate | est_p | |
---|---|---|---|---|
0 | All | 2185 | 2617 | 0.834925 |
1 | All | 2185 | 2661 | 0.821120 |
2 | All | 2185 | 2596 | 0.841680 |
3 | All | 2185 | 2731 | 0.800073 |
4 | All | 2185 | 2645 | 0.826087 |
# Inspect aggregate results
estimates.groupby('Language').mean().reset_index()
Language | extant | estimate | est_p | |
---|---|---|---|---|
0 | All | 2185.0 | 2728.371094 | 0.801362 |
1 | English | 47.0 | 158.523438 | 0.303094 |
2 | French | 1347.0 | 1607.660156 | 0.838306 |
3 | German | 329.0 | 476.304688 | 0.698782 |
4 | Italian | 215.0 | 347.675781 | 0.620876 |
# Plot the distributions for a finer-grained look
plt.rcParams['figure.figsize'] = [12, 8]
p_plot = sns.displot(
estimates,
x='est_p',
hue='Language',
kind='kde',
)
plt.xlim(0, 1)
p_plot.set(
title='Estimated Survival Rates of 15th-Century Song',
xlabel='Estimated Survival Rate',
ylabel='Simulated Probability Density',
yticklabels=[]
)
plt.show();
Taken as a whole, quite a lot of literate 15th-century song survived the ravages of history, almost 80%! The estimated survival rate is highly dependent on language, though, with Francophone songs far more likely to survive than any other corpus of songs, and Anglophone songs particularly unlikely to survive (probably <30%).