skhubness.analysis API¶
skhubness.analysis Package¶
The skhubness.analysis
package provides methods for measuring hubness.
-
class
skhubness.analysis.
Hubness
(k: int = 10, hub_size: float = 2.0, metric='euclidean', k_neighbors: bool = False, k_occurrence: bool = False, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True, **kwargs)[source]¶ Hubness characteristics of data set.
- Parameters
- kint
Neighborhood size
- hub_sizefloat
Hubs are defined as objects with k-occurrence > hub_size * k.
- metricstring, one of [‘euclidean’, ‘cosine’, ‘precomputed’]
Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.
- k_neighborsbool
Whether to save the k-neighbor lists. Requires O(n_test * k) memory.
- k_occurrencebool
Whether to save the k-occurrence. Requires O(n_test) memory.
- random_stateint, RandomState instance or None, optional
CURRENTLY IGNORED. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- shuffle_equalbool, optional
If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.
- n_jobsint, optional
CURRENTLY IGNORED. Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs
- verboseint, optional
Level of output messages
References
- Ra56b19eecc1a-1
Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531
- Ra56b19eecc1a-2
Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).
- Attributes
- k_skewness_float
Hubness, measured as skewness of k-occurrence histogram [Ra56b19eecc1a-1]
- k_skewness_truncnomfloat
Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram
- atkinson_index_float
Hubness, measured as the Atkinson index of k-occurrence distribution
- gini_index_float
Hubness, measured as the Gini index of k-occurrence distribution
- robinhood_index_float
Hubness, measured as Robin Hood index of k-occurrence distribution [Ra56b19eecc1a-2]
- antihubs_int
Indices to antihubs
- antihub_occurrence_float
Proportion of antihubs in data set
- hubs_int
Indices to hubs
- hub_occurrence_float
Proportion of k-nearest neighbor slots occupied by hubs
- groupie_ratio_float
Proportion of objects with the largest hub in their neighborhood
- k_occurrence_ndarray
Reverse neighbor count for each object
- k_neighbors_ndarray
Indices to k-nearest neighbors for each object
-
static
antihub_occurrence
(k_occurrence: numpy.ndarray) -> (<built-in function array>, <class 'float'>)[source]¶ Proportion of antihubs in data set.
Antihubs are objects that are never among the nearest neighbors of other objects.
- Parameters
- k_occurrencendarray
Reverse nearest neighbor count for each object.
-
static
atkinson_index
(k_occurrence: numpy.ndarray, eps: float = 0.5) → float[source]¶ Hubness measure; Atkinson index.
- Parameters
- k_occurrencendarray
Reverse nearest neighbor count for each object.
- epsfloat, default = 0.5
‘Income’ weight. Turns the index into a normative measure.
-
estimate
(self, X: numpy.ndarray, Y: numpy.ndarray = None, has_self_distances: bool = False)[source]¶ Estimate hubness in a data set.
Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.
- Parameters
- Xndarray, shape (n_query, n_features) or (n_query, n_indexed)
Array of query vectors, or distance, if self.metric == ‘precomputed’
- Yndarray, shape (n_indexed, n_features) or None
Array of indexed vectors. If None, calculate distance between all pairs of objects in X.
- has_self_distancesbool, default = False
Define, whether a precomputed distance matrix contains self distances, which need to be excluded.
- Returns
- selfHubness
An instance of class Hubness is returned. Hubness indices are provided as attributes (e.g.
robinhood_index_()
).
-
static
gini_index
(k_occurrence: numpy.ndarray, limiting='memory') → float[source]¶ Hubness measure; Gini index
- Parameters
- k_occurrencendarray
Reverse nearest neighbor count for each object.
- limiting‘memory’ or ‘cpu’
If ‘cpu’, use fast implementation with high memory usage, if ‘memory’, use slighly slower, but memory-efficient implementation, otherwise use naive implementation (slow, low memory usage)
-
static
hub_occurrence
(k: int, k_occurrence: numpy.ndarray, n_test: int, hub_size: float = 2)[source]¶ Proportion of nearest neighbor slots occupied by hubs.
- Parameters
- kint
Specifies the number of nearest neighbors
- k_occurrencendarray
Reverse nearest neighbor count for each object.
- n_testint
Number of queries (or objects in a test set)
- hub_sizefloat
Factor to determine hubs
-
static
robinhood_index
(k_occurrence: numpy.ndarray) → float[source]¶ Hubness measure; Robin hood/Hoover/Schutz index.
- Parameters
- k_occurrencendarray
Reverse nearest neighbor count for each object.
Notes
The Robin Hood index was proposed in [1] and is especially suited for hubness estimation in large data sets. Additionally, it offers straight-forward interpretability by answering the question: What share of k-occurrence must be redistributed, so that all objects are equally often nearest neighbors to others?
References
- 1
Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).
-
static
skewness_truncnorm
(k_occurrence: numpy.ndarray) → float[source]¶ Hubness measure; corrected for non-negativity of k-occurrence.
Hubness as skewness of truncated normal distribution estimated from k-occurrence histogram.
- Parameters
- k_occurrencendarray
Reverse nearest neighbor count for each object.