skhubness.analysis.Hubness

class skhubness.analysis.Hubness(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]

Examine hubness characteristics of data.

Parameters
k: int

Neighborhood size

return_value: str, default = “k_skewness”

Hubness measure to return by score() By default, this is the skewness of the k-occurrence histogram. Use “all” to return a dict of all available measures, or check skhubness.analysis.VALID_HUBNESS_MEASURE for available measures.

hub_size: float

Hubs are defined as objects with k-occurrence > hub_size * k.

metric: string, one of [‘euclidean’, ‘cosine’, ‘precomputed’]

Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.

store_k_neighbors: bool

Whether to save the k-neighbor lists. Requires O(n_test * k) memory.

store_k_occurrence: bool

Whether to save the k-occurrence. Requires O(n_test) memory.

algorithm: {‘auto’, ‘hnsw’, ‘lsh’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

  • ‘hnsw’ will use HNSW

  • ‘lsh’ will use FalconnLSH

  • ‘ball_tree’ will use BallTree

  • ‘kd_tree’ will use KDTree

  • ‘brute’ will use a brute-force search.

  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

algorithm_params: dict, optional

Override default parameters of the NN algorithm. For example, with algorithm=’lsh’ and algorithm_params={n_candidates: 100} one hundred approximate neighbors are retrieved with LSH. If parameter hubness is set, the candidate neighbors are further reordered with hubness reduction. Finally, n_neighbors objects are used from the (optionally reordered) candidates.

hubness: {‘mutual_proximity’, ‘local_scaling’, ‘dis_sim_local’, None}, optional

Hubness reduction algorithm

  • ‘mutual_proximity’ or ‘mp’ will use MutualProximity

  • ‘local_scaling’ or ‘ls’ will use LocalScaling

  • ‘dis_sim_local’ or ‘dsl’ will use DisSimLocal

If None, no hubness reduction will be performed (=vanilla kNN).

hubness_params: dict, optional

Override default parameters of the selected hubness reduction algorithm. For example, with hubness=’mp’ and hubness_params={‘method’: ‘normal’} a mutual proximity variant is used, which models distance distributions with independent Gaussians.

random_state: int, RandomState instance or None, optional

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle_equal: bool, optional

If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.

n_jobs: int, optional

Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs Note that not all steps are currently parallelized.

verbose: int, optional

Level of output messages

References

Ra56b19eecc1a-1

Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531

Ra56b19eecc1a-2

Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).

Attributes
k_skewness: float

Hubness, measured as skewness of k-occurrence histogram [Ra56b19eecc1a-1]

k_skewness_truncnorm: float

Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram

atkinson_index: float

Hubness, measured as the Atkinson index of k-occurrence distribution

gini_index: float

Hubness, measured as the Gini index of k-occurrence distribution

robinhood_index: float

Hubness, measured as Robin Hood index of k-occurrence distribution [Ra56b19eecc1a-2]

antihubs: int

Indices to antihubs

antihub_occurrence: float

Proportion of antihubs in data set

hubs: int

Indices to hubs

hub_occurrence: float

Proportion of k-nearest neighbor slots occupied by hubs

groupie_ratio: float

Proportion of objects with the largest hub in their neighborhood

k_occurrence: ndarray

Reverse neighbor count for each object

k_neighbors: ndarray

Indices to k-nearest neighbors for each object

__init__(self, k: 'int' = 10, return_value: 'str' = 'k_skewness', hub_size: 'float' = 2.0, metric='euclidean', store_k_neighbors: 'bool' = False, store_k_occurrence: 'bool' = False, algorithm: 'str' = 'auto', algorithm_params: 'dict' = None, hubness: 'str' = None, hubness_params: 'dict' = None, verbose: 'int' = 0, n_jobs: 'int' = 1, random_state=None, shuffle_equal: 'bool' = True)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(self, k, return_value, hub_size[, …])

Initialize self.

fit(self, X[, y])

Fit indexed objects.

get_params(self[, deep])

Get parameters for this estimator.

score(self, X[, y])

Estimate hubness in a data set.

set_params(self, \*\*params)

Set the parameters of this estimator.

fit(self, X, y=None) → 'Hubness'[source]

Fit indexed objects.

Parameters
X: {array-like, sparse matrix}, shape (n_samples, n_features) or (n_query, n_indexed) if metric==’precomputed’

Training data vectors or distance matrix, if metric == ‘precomputed’.

y: ignored
Returns
self:

Fitted instance of :mod:Hubness

get_params(self, deep=True)

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

score(self, X: 'np.ndarray' = None, y=None, has_self_distances: 'bool' = False) → 'Union[float, dict]'[source]

Estimate hubness in a data set.

Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.

Parameters
X: ndarray, shape (n_query, n_features) or (n_query, n_indexed)

Array of query vectors, or distance, if self.metric == ‘precomputed’

y: ignored
has_self_distances: bool, default = False

Define, whether a precomputed distance matrix contains self distances, which need to be excluded.

Returns
hubness_measure: float or dict

Return the hubness measure as indicated by return_value. Additional hubness indices are provided as attributes (e.g. robinhood_index_()). if return_value is ‘all’, a dict of all hubness measures is returned.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self