repliclust.overlap#
repliclust.overlap.centers#
This module implements a ClusterCenterSampler based on achieving the desired degree of pairwise overlap between clusters by minimizing an objective function.
- class repliclust.overlap.centers.ConstrainedOverlapCenters(max_overlap=0.1, min_overlap=0.09, packing=0.1, **optimization_args)#
Bases:
ClusterCenterSampler
This class provides an implementation for optimizing the location of cluster centers to achieve the desired degrees of overlap between pairs of clusters.
- Parameters
max_overlap (float between 0 and 1) – The maximum allowed overlap between two cluster centers, measured as a fraction of cluster mass.
min_overlap (float) – The minimum overlap each cluster needs to have with some other cluster, preventing it to be isolated. The overlap is measured as a fraction of cluster mass.
packing (float) – Sets the ratio of total cluster volume to the sampling volume. Used when choosing random cluster centers for initializing the optimization.
learning_rate (float) – The rate at which cluster centers are optimized. If numerical instabilities appear, it is recommended to lower this number.
max_epoch (int) – The maximum number of optimization epochs to run. Increasing this number may slow down the optimization.
tol (float) – Numerical tolerance for achieving the desired overlap between pairs of clusters.
- sample_cluster_centers(archetype, print_progress=False)#
Sample cluster centers at random and iteratively adjust them until the desired degrees of overlap between clusters are satisfied.
- Parameters
archetype (Archetype) – Archetype conveying the desired number of clusters and other attributes.
print_progress (bool) – If true, print the progress during optimization.
- Returns
centers – The optimized cluster centers.
- Return type
ndarray
repliclust.overlap.gradients#
- repliclust.overlap.gradients.assess_obs_overlap(centers, cov_inv)#
Assess the observed min and maximum overlap between cluster centers.
- repliclust.overlap.gradients.chi2term_vectorized(mharsum_vec, p)#
Compute the chi2 term of the overlap gradients with respect to a reference cluster center vs all other centers (vectorized). Compute the harmonic sum of Mahalanobis distances (vectorized).
- Parameters
mharsum_vec (ndarray, shape (1, k-1)) – Harmonic sum of Mahalanobis distances (vectorized).
p (int) – Dimensionality of the data (degrees of freedom for chi2).
- Returns
out – Chi2(p) density evaluated at the appropriate quantile cutoffs.
- Return type
ndarray, shape (1, k-1)
- repliclust.overlap.gradients.cluster_loss(cluster_idx, centers, cov_inv, overlap_bounds)#
Compute the overlap loss for a reference cluster.
- repliclust.overlap.gradients.compute_other_cluster_idx(cluster_idx, k)#
Compute other cluster indices.
- Parameters
cluster_idx (int) – Cluster index to exclude.
k (int) – Number of clusters.
- Returns
out – All cluster indices except for cluster_idx.
- Return type
list of int
- repliclust.overlap.gradients.compute_overlaps_vectorized(mharsum_vec, p)#
Compute overlaps between a reference cluster and all other clusters.
- Parameters
mharsum_vec (ndarray, shape (1, k-1)) – Harmonic sum of Mahalanobis distances.
p (int) – Dimensionality of the clusters / degrees of freedom for the chi-square distribution.
- Returns
out – Overlaps between reference cluster and all other clusters.
- Return type
ndarray, shape (1, k-1)
- repliclust.overlap.gradients.cubicterm_vectorized(mharsum_vec)#
Compute the inverse cubic term of the overlap gradients with respect to a reference cluster center vs all other centers (vectorized).
- Parameters
mharsum_vec (ndarray, shape (1, k-1)) – Harmonic sum of Mahalanobis distances (vectorized).
- Returns
out – Inverse cubic term of the overlap gradients.
- Return type
ndarray, shape (1, k-1)
- repliclust.overlap.gradients.gradient_vectorized(diff_mat=None, diff_tf_mat_1=None, diff_tf_mat_2=None, mdist_vec_1=None, mdist_vec_2=None, mharsum_vec=None, mode='overlap')#
Compute the gradient of overlaps of a reference cluster with all other k-1 clusters.
- Parameters
diff_mat (ndarray, shape (p, k-1)) – Matrix of differences between reference cluster center and the other k-1 cluster centers.
diff_tf_mat_1 (ndarray, shape (p, k-1)) – Same as diff_mat, except each column is left-multiplied by inverse covariance matrix of reference cluster.
diff_tf_mat_2 (ndarray, shape (p, k-1)) – Same as diff_mat, except each column is left-multiplied by inverse covariance matrix of corresponding OTHER cluster.
- Returns
out – Gradient vectors of the reference cluster’s overlap with the other clusters, with respect to the reference center. The j-th column of this matrix is the derivative of the overlap between the reference cluster and the j-th other cluster with respect to the reference center. To get the derivative of the same quantity with respect to the centers of the OTHER clusters, simply multiply the output by -1.
- Return type
ndarray, shape (p, k-1)
- repliclust.overlap.gradients.harsum_vectorized(X, Y)#
Compute harmonic sum (vectorized).
- Parameters
X (ndarray, shape (1, k-1)) – Input matrix.
Y (ndarray, shape (1, k-1)) – Input matrix.
- Returns
out – Harmonic sum of X and Y.
- Return type
ndarray, shape (1, k-1)
- repliclust.overlap.gradients.make_mahalanobis_args(diff_mat, diff_tf_mat_1, diff_tf_mat_2)#
Cmopute Mahalanobis quantities for use in other functions.
- repliclust.overlap.gradients.make_mharsum_vec(cluster_idx, centers, cov_inv)#
Compute harmonic sum of Mahalanobis distances from centers and inverse covariance matrices.
- repliclust.overlap.gradients.make_premahalanobis_args(cluster_idx, other_cluster_idx, centers, cov_inv)#
Compute some quantities needed in other functions: differences between cluster centers, differences transformed by the reference clusters inverse covariance matrix, and differences transformed by the corresponding clusters’ covariance matrices.
- Parameters
cluster_idx (int) – Index of reference cluster.
other_cluster_idx (list of int) – List of other cluster indices.
centers (ndarray) – Matrix of cluster centers. Each row is a center.
cov_inv (list of ndarray) – List of inverse covariance matrices.
- Returns
out – Provide quantities useful for downstream computations.
- Return type
dict with keys ‘diff_mat’, ‘diff_tf_mat_1’, ‘diff_tf_mat_2’
- repliclust.overlap.gradients.mdist_vectorized(diff_mat, diff_tf_mat)#
Compute Mahalanobis distance (vectorized).
- Parameters
diff_mat (ndarray, shape (p, k-1)) – Differences between pairs of distinct cluster centers, one vs all. The j-th column is the difference vector between the reference cluster center and the j-th other cluster center.
diff_tf_mat (ndarray, shape (p, k-1)) – Differences between pairs of distinct cluster centers left multiplied by the appropriate inverse covariance matrices.
- Returns
mdist – Mahalanobis distances between reference cluster and other clusters.
- Return type
ndarray, shape (1, k-1)
- repliclust.overlap.gradients.squareterm_vectorized(mharsum_vec)#
Compute the inverse square term.
- repliclust.overlap.gradients.summandterm_vectorized(mdist_vec, diff_tf_mat)#
Compute summand term by broadcasting.
- Parameters
mdist_vec (ndarray, shape (1, k-1)) – The vectorized Mahalanobis distances.
diff_tf_mat (ndarray, shape (p, k-1)) – Differences between pairs of distinct cluster centers left multiplied by the appropriate inverse covariance matrices.
- Returns
out – Summand term involved in computing the overlap gradient.
- Return type
ndarray, shape (p, k-1)
- repliclust.overlap.gradients.total_loss(centers, cov_inv, overlap_bounds)#
Compute the total overlap loss.
- repliclust.overlap.gradients.update_centers(cluster_idx, centers, cov_inv, learning_rate, overlap_bounds)#
Perform an iteration of stochastic gradient descent on the cluster centers.
- Parameters
cluster_idx (int) – Index of reference cluster (for stochastic gradient descent).
centers (ndarray, shape (k, p)) – Matrix of all cluster centers. Each row is a center.
cov_inv (list of ndarray; length k, each ndarray of shape (p, p)) – List of inverse covariance matrices.
learning_rate (float) – Learning rate for gradient descent.
overlap_bounds (dict with keys 'min' and 'max') – Minimum and maximum allowed overlaps between clusters.
effects (Side) –
------------ –
step. (Update centers by taking a stochastic gradient descent) –