skhubness.neighbors.NNG

class skhubness.neighbors.NNG(n_candidates: int = 5, metric: str = 'euclidean', index_dir: str = 'auto', optimize: bool = False, edge_size_for_creation: int = 80, edge_size_for_search: int = 40, num_incoming: int = -1, num_outgoing: int = -1, epsilon: float = 0.1, n_jobs: int = 1, verbose: int = 0)[source]

Wrapper for ngtpy and NNG variants.

By default, the graph is an ANNG. Only when the optimize parameter is set, the graph is optimized to obtain an ONNG.

Parameters
n_candidates: int, default = 5

Number of neighbors to retrieve

metric: str, default = ‘euclidean’

Distance metric, allowed are ‘manhattan’, ‘L1’, ‘euclidean’, ‘L2’, ‘minkowski’, ‘Angle’, ‘Normalized Angle’, ‘Hamming’, ‘Jaccard’, ‘Cosine’ or ‘Normalized Cosine’.

index_dir: str, default = ‘auto’

Store the index in the given directory. If None, keep the index in main memory (NON pickleable index), If index_dir is a string, it is interpreted as a directory to store the index into, if ‘auto’, create a temp dir for the index, preferably in /dev/shm on Linux. Note: The directory/the index will NOT be deleted automatically.

optimize: bool, default = False

Use ONNG method by optimizing the ANNG graph. May require long time for index creation.

edge_size_for_creation: int, default = 80

Increasing ANNG edge size improves retrieval accuracy at the cost of more time

edge_size_for_search: int, default = 40

Increasing ANNG edge size improves retrieval accuracy at the cost of more time

epsilon: float, default 0.1

Trade-off in ANNG between higher accuracy (larger epsilon) and shorter query time (smaller epsilon)

num_incoming: int

Number of incoming edges in ONNG graph

num_outgoing: int

Number of outgoing edges in ONNG graph

n_jobs: int, default = 1

Number of parallel jobs

verbose: int, default = 0

Verbosity level. If verbose > 0, show tqdm progress bar on indexing and querying.

Notes

NNG stores the index to a directory specified in index_dir. The index is persistent, and will NOT be deleted automatically. It is the user’s responsibility to take care of deletion, when required.

Attributes
valid_metrics:

List of valid distance metrics/measures

__init__(self, n_candidates: 'int' = 5, metric: 'str' = 'euclidean', index_dir: 'str' = 'auto', optimize: 'bool' = False, edge_size_for_creation: 'int' = 80, edge_size_for_search: 'int' = 40, num_incoming: 'int' = -1, num_outgoing: 'int' = -1, epsilon: 'float' = 0.1, n_jobs: 'int' = 1, verbose: 'int' = 0)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(self, n_candidates, metric, …)

Initialize self.

fit(self, X[, y])

Build the ngtpy.Index and insert data from X.

get_params(self[, deep])

Get parameters for this estimator.

kneighbors(self[, X, n_candidates, …])

Retrieve k nearest neighbors.

set_params(self, \*\*params)

Set the parameters of this estimator.

Attributes

internal_distance_type

valid_metrics

fit(self, X, y=None) → 'NNG'[source]

Build the ngtpy.Index and insert data from X.

Parameters
X: np.array

Data to be indexed

y: any

Ignored

Returns
self: NNG

An instance of NNG with a built index

get_params(self, deep=True)

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

kneighbors(self, X=None, n_candidates=None, return_distance=True) → 'Union[Tuple[np.array, np.array], np.array]'[source]

Retrieve k nearest neighbors.

Parameters
X: np.array or None, optional, default = None

Query objects. If None, search among the indexed objects.

n_candidates: int or None, optional, default = None

Number of neighbors to retrieve. If None, use the value passed during construction.

return_distance: bool, default = True

If return_distance, will return distances and indices to neighbors. Else, only return the indices.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self