repliclust.overlap#

repliclust.overlap.centers#

This module implements a ClusterCenterSampler based on achieving the desired degree of pairwise overlap between clusters by minimizing an objective function.

class repliclust.overlap.centers.ConstrainedOverlapCenters(max_overlap=0.1, min_overlap=0.09, packing=0.1, **optimization_args)#

Bases: ClusterCenterSampler

This class provides an implementation for optimizing the location of cluster centers to achieve the desired degrees of overlap between pairs of clusters.

Parameters
  • max_overlap (float between 0 and 1) – The maximum allowed overlap between two cluster centers, measured as a fraction of cluster mass.

  • min_overlap (float) – The minimum overlap each cluster needs to have with some other cluster, preventing it to be isolated. The overlap is measured as a fraction of cluster mass.

  • packing (float) – Sets the ratio of total cluster volume to the sampling volume. Used when choosing random cluster centers for initializing the optimization.

  • learning_rate (float) – The rate at which cluster centers are optimized. If numerical instabilities appear, it is recommended to lower this number.

  • max_epoch (int) – The maximum number of optimization epochs to run. Increasing this number may slow down the optimization.

  • tol (float) – Numerical tolerance for achieving the desired overlap between pairs of clusters.

sample_cluster_centers(archetype, print_progress=False)#

Sample cluster centers at random and iteratively adjust them until the desired degrees of overlap between clusters are satisfied.

Parameters
  • archetype (Archetype) – Archetype conveying the desired number of clusters and other attributes.

  • print_progress (bool) – If true, print the progress during optimization.

Returns

centers – The optimized cluster centers.

Return type

ndarray

repliclust.overlap.gradients#

repliclust.overlap.gradients.assess_obs_overlap(centers, cov_inv)#

Assess the observed min and maximum overlap between cluster centers.

repliclust.overlap.gradients.chi2term_vectorized(mharsum_vec, p)#

Compute the chi2 term of the overlap gradients with respect to a reference cluster center vs all other centers (vectorized). Compute the harmonic sum of Mahalanobis distances (vectorized).

Parameters
  • mharsum_vec (ndarray, shape (1, k-1)) – Harmonic sum of Mahalanobis distances (vectorized).

  • p (int) – Dimensionality of the data (degrees of freedom for chi2).

Returns

out – Chi2(p) density evaluated at the appropriate quantile cutoffs.

Return type

ndarray, shape (1, k-1)

repliclust.overlap.gradients.cluster_loss(cluster_idx, centers, cov_inv, overlap_bounds)#

Compute the overlap loss for a reference cluster.

repliclust.overlap.gradients.compute_other_cluster_idx(cluster_idx, k)#

Compute other cluster indices.

Parameters
  • cluster_idx (int) – Cluster index to exclude.

  • k (int) – Number of clusters.

Returns

out – All cluster indices except for cluster_idx.

Return type

list of int

repliclust.overlap.gradients.compute_overlaps_vectorized(mharsum_vec, p)#

Compute overlaps between a reference cluster and all other clusters.

Parameters
  • mharsum_vec (ndarray, shape (1, k-1)) – Harmonic sum of Mahalanobis distances.

  • p (int) – Dimensionality of the clusters / degrees of freedom for the chi-square distribution.

Returns

out – Overlaps between reference cluster and all other clusters.

Return type

ndarray, shape (1, k-1)

repliclust.overlap.gradients.cubicterm_vectorized(mharsum_vec)#

Compute the inverse cubic term of the overlap gradients with respect to a reference cluster center vs all other centers (vectorized).

Parameters

mharsum_vec (ndarray, shape (1, k-1)) – Harmonic sum of Mahalanobis distances (vectorized).

Returns

out – Inverse cubic term of the overlap gradients.

Return type

ndarray, shape (1, k-1)

repliclust.overlap.gradients.gradient_vectorized(diff_mat=None, diff_tf_mat_1=None, diff_tf_mat_2=None, mdist_vec_1=None, mdist_vec_2=None, mharsum_vec=None, mode='overlap')#

Compute the gradient of overlaps of a reference cluster with all other k-1 clusters.

Parameters
  • diff_mat (ndarray, shape (p, k-1)) – Matrix of differences between reference cluster center and the other k-1 cluster centers.

  • diff_tf_mat_1 (ndarray, shape (p, k-1)) – Same as diff_mat, except each column is left-multiplied by inverse covariance matrix of reference cluster.

  • diff_tf_mat_2 (ndarray, shape (p, k-1)) – Same as diff_mat, except each column is left-multiplied by inverse covariance matrix of corresponding OTHER cluster.

Returns

out – Gradient vectors of the reference cluster’s overlap with the other clusters, with respect to the reference center. The j-th column of this matrix is the derivative of the overlap between the reference cluster and the j-th other cluster with respect to the reference center. To get the derivative of the same quantity with respect to the centers of the OTHER clusters, simply multiply the output by -1.

Return type

ndarray, shape (p, k-1)

repliclust.overlap.gradients.harsum_vectorized(X, Y)#

Compute harmonic sum (vectorized).

Parameters
  • X (ndarray, shape (1, k-1)) – Input matrix.

  • Y (ndarray, shape (1, k-1)) – Input matrix.

Returns

out – Harmonic sum of X and Y.

Return type

ndarray, shape (1, k-1)

repliclust.overlap.gradients.make_mahalanobis_args(diff_mat, diff_tf_mat_1, diff_tf_mat_2)#

Cmopute Mahalanobis quantities for use in other functions.

repliclust.overlap.gradients.make_mharsum_vec(cluster_idx, centers, cov_inv)#

Compute harmonic sum of Mahalanobis distances from centers and inverse covariance matrices.

repliclust.overlap.gradients.make_premahalanobis_args(cluster_idx, other_cluster_idx, centers, cov_inv)#

Compute some quantities needed in other functions: differences between cluster centers, differences transformed by the reference clusters inverse covariance matrix, and differences transformed by the corresponding clusters’ covariance matrices.

Parameters
  • cluster_idx (int) – Index of reference cluster.

  • other_cluster_idx (list of int) – List of other cluster indices.

  • centers (ndarray) – Matrix of cluster centers. Each row is a center.

  • cov_inv (list of ndarray) – List of inverse covariance matrices.

Returns

out – Provide quantities useful for downstream computations.

Return type

dict with keys ‘diff_mat’, ‘diff_tf_mat_1’, ‘diff_tf_mat_2’

repliclust.overlap.gradients.mdist_vectorized(diff_mat, diff_tf_mat)#

Compute Mahalanobis distance (vectorized).

Parameters
  • diff_mat (ndarray, shape (p, k-1)) – Differences between pairs of distinct cluster centers, one vs all. The j-th column is the difference vector between the reference cluster center and the j-th other cluster center.

  • diff_tf_mat (ndarray, shape (p, k-1)) – Differences between pairs of distinct cluster centers left multiplied by the appropriate inverse covariance matrices.

Returns

mdist – Mahalanobis distances between reference cluster and other clusters.

Return type

ndarray, shape (1, k-1)

repliclust.overlap.gradients.squareterm_vectorized(mharsum_vec)#

Compute the inverse square term.

repliclust.overlap.gradients.summandterm_vectorized(mdist_vec, diff_tf_mat)#

Compute summand term by broadcasting.

Parameters
  • mdist_vec (ndarray, shape (1, k-1)) – The vectorized Mahalanobis distances.

  • diff_tf_mat (ndarray, shape (p, k-1)) – Differences between pairs of distinct cluster centers left multiplied by the appropriate inverse covariance matrices.

Returns

out – Summand term involved in computing the overlap gradient.

Return type

ndarray, shape (p, k-1)

repliclust.overlap.gradients.total_loss(centers, cov_inv, overlap_bounds)#

Compute the total overlap loss.

repliclust.overlap.gradients.update_centers(cluster_idx, centers, cov_inv, learning_rate, overlap_bounds)#

Perform an iteration of stochastic gradient descent on the cluster centers.

Parameters
  • cluster_idx (int) – Index of reference cluster (for stochastic gradient descent).

  • centers (ndarray, shape (k, p)) – Matrix of all cluster centers. Each row is a center.

  • cov_inv (list of ndarray; length k, each ndarray of shape (p, p)) – List of inverse covariance matrices.

  • learning_rate (float) – Learning rate for gradient descent.

  • overlap_bounds (dict with keys 'min' and 'max') – Minimum and maximum allowed overlaps between clusters.

  • effects (Side) –

  • ------------

  • step. (Update centers by taking a stochastic gradient descent) –