cnnclustering - A Python module for common-nearest-neighbour clustering¶
Go to:
cluster¶
The functionality of this module is primarily exposed and bundled by the
cnnclustering.cluster.Clustering
class. For hierarchical clusterings
cnnclustering.cluster.ClusteringChild
is used, too.
-
class
cnnclustering.cluster.
Clustering
(input_data=None, neighbours_getter=None, neighbours=None, neighbour_neighbours=None, metric=None, similarity_checker=None, queue=None, fitter=None, predictor=None, labels=None, alias: unicode = 'root', parent=None)¶ Represents a clustering endeavour
A clustering object is made by composition of all necessary parts to carry out a clustering of input data points.
Note
A clustering instance may also be created using the convenience function
cnnclustering.cluster.prepare_clustering()
- Parameters
input_data – Any object implementing the input data interface. Represents the data points to be clustered.
neighbours_getter – Any object implementing the neighbours getter interface. Controls how neighbours are retrieved/calculated from input data.
neighbours – Any object implementing the neighbours interface. Represents neighbours found by the neighbours_getter.
neighbour_neighbours – Same as neighbours but used for the neighbours of the neighbours.
metric – Any object implementing the metric interface. Can be used by neighbours_getter to retrieved/calculated neighbours from input data.
similarity_checker – Any object implementing the similarity checker interface. Evaluates if to points in input_data are part of the same cluster based on their neighbours.
queue – Any object implementing the queue interface. May be used during the clustering procedure.
fitter – Any object implementing the fitter interface. Executes the clustering procedure.
predictor – Any object implementing the predictor interface. Translates a clustering result to another
cnnclustering.cluster.Clustering
object with different input_data.labels – An instance of
cnnclustering._types.Labels
holding cluster label assignments for points in input_data.alias – A descriptive string identifier associated with this clustering.
parent – If not None, an instance of
cnnclustering.cluster.Clustering
of which this clustering is a child of.
-
input_data
¶ A representation of the input data, typically a (list of) NumPy array(s). Shorthand for :obj:`self._input_data.data
-
hierarchy_level
¶ The level of this clustering in the hierarchical tree of clusterings (0 for the root instance).
-
labels
¶ An instance of
cnnclustering._types.Labels
holding cluster label assignments for points in input_data.
-
children
¶ A dictionary with child cluster labels as keys and
cnnclustering.cluster.Clustering
instances as values.
-
summary
¶ An instance of
cnnclustering.cluster.Summary
collecting clustering results.
-
property
children
¶
-
evaluate
(self, ax=None, clusters: Optional[Container[int]] = None, original: bool = False, unicode plot_style: str = u'dots', parts: Optional[Tuple[Optional[int]]] = None, points: Optional[Tuple[Optional[int]]] = None, dim: Optional[Tuple[int, int]] = None, mask: Optional[Sequence[Union[bool, int]]] = None, ax_props: Optional[dict] = None, annotate: bool = True, unicode annotate_pos: str = u'mean', annotate_props: Optional[dict] = None, plot_props: Optional[dict] = None, plot_noise_props: Optional[dict] = None, hist_props: Optional[dict] = None, free_energy: bool = True)¶ Returns a 2D plot of an original data set or a cluster result
- Args: ax: The Axes instance to which to add the plot. If
None, a new Figure with Axes will be created.
- clusters:
Cluster numbers to include in the plot. If None, consider all.
- original:
Allows to plot the original data instead of a cluster result. Overrides clusters. Will be considered True, if no cluster result is present.
- plot_style:
The kind of plotting method to use.
“dots”,
ax.plot()
“scatter”,
ax.scatter()
“contour”,
ax.contour()
“contourf”,
ax.contourf()
- parts:
Use a slice (start, stop, stride) on the data parts before plotting. Will be applied before a slice on points.
- points:
Use a slice (start, stop, stride) on the data points before plotting.
- dim:
Use these two dimensions for plotting. If None, uses (0, 1).
- mask:
Sequence of boolean or integer values used for optional fancy indexing on the point data array. Note, that this is applied after regular slicing (e.g. via points) and requires a copy of the indexed data (may be slow and memory intensive for big data sets).
- annotate:
If there is a cluster result, plot the cluster numbers. Uses annotate_pos to determinte the position of the annotations.
- annotate_pos:
Where to put the cluster number annotation. Can be one of:
“mean”, Use the cluster mean
“random”, Use a random point of the cluster
Alternatively a list of x, y positions can be passed to set a specific point for each cluster (Not yet implemented)
- annotate_props:
Dictionary of keyword arguments passed to
ax.annotate()
.- ax_props:
Dictionary of ax properties to apply after plotting via
ax.set(**ax_props)()
. If None, uses defaults that can be also defined in the configuration file (Note yet implemented).- plot_props:
Dictionary of keyword arguments passed to various functions (
plot.plot_dots()
etc.) with different meaning to format cluster plotting. If None, uses defaults that can be also defined in the configuration file (Note yet implemented).- plot_noise_props:
Like plot_props but for formatting noise point plotting.
- hist_props:
Dictionary of keyword arguments passed to functions that involve the computing of a histogram via numpy.histogram2d.
- free_energy:
If True, converts computed histograms to pseudo free energy surfaces.
- Returns
Figure, Axes and a list of plotted elements
-
fit
(self, double radius_cutoff: float, cnn_cutoff: int, member_cutoff: int = None, max_clusters: int = None, cnn_offset: int = None, sort_by_size: bool = True, info: bool = True, record: bool = True, record_time: bool = True, v: bool = True, purge: bool = False) → None¶ Execute clustering procedure
- Parameters
radius_cutoff – Neighbour search radius.
cnn_cutoff – Similarity criterion.
member_cutoff – Valid clusters need to have at least this many members. Passed on to
Labels.sort_by_size()
if sort_by_size is True. Has no effect otherwise and valid clusters have at least one member.max_clusters – Keep only the largest max_clusters clusters. Passed on to
Labels.sort_by_size()
if sort_by_size is True. Has no effect otherwise.cnn_offset – Exists for compatibility reasons and is substracted from cnn_cutoff. If cnn_offset = 0, two points need to share at least cnn_cutoff neighbours to be part of the same cluster without counting any of the two points. In former versions of the clustering, self-counting was included and cnn_cutoff = 2 is equivalent to cnn_cutoff = 0 in this version.
sort_by_size – Weather to sort (and trim) the created
Labels
instance. See alsoLabels.sort_by_size()
.info – Wether to modify
Labels.meta
information for this clustering.record – Wether to create a
Record
instance for this clustering which is appended to theSummary
.record_time – Wether to time clustering execution.
v – Be chatty.
purge – If True, force re-initialisation of cluster label assignments.
-
fit_hierarchical
(self, radius_cutoff: Union[float, List[float]], cnn_cutoff: Union[int, List[int]], member_cutoff: int = None, max_clusters: int = None, cnn_offset: int = None)¶ Execute hierarchical clustering procedure
-
get_child
(self, label)¶
-
property
hierarchy_level
¶
-
property
input_data
¶
-
isolate
(self, bool purge: bool = True, bool isolate_input_data: bool = True)¶ Create child clusterings from cluster labels
- Parameters
purge – If True, creates a new mapping for the children of this clustering.
isolate_input_data – If True, attaches a subset of the input data of this clustering to the child.
-
property
labels
¶
-
pie
(self, ax=None, pie_props=None)¶
-
predict
(self, other: Type[u'Clustering'], double radius_cutoff: float, cnn_cutoff: int, clusters: Optional[Sequence[int]] = None, cnn_offset: Optional[int] = None, info: bool = True, record: bool = True, record_time: bool = True, v: bool = True, purge: bool = False)¶ Execute prediction procedure
- Parameters
other –
cnnclustering.cluster.Clustering
instance for which cluster labels should be predicted.radius_cutoff – Neighbour search radius.
cnn_cutoff – Similarity criterion.
cluster – Sequence of cluster labels that should be included in the prediction.
cnn_offset – Exists for compatibility reasons and is substracted from cnn_cutoff. If cnn_offset = 0, two points need to share at least cnn_cutoff neighbours to be part of the same cluster without counting any of the two points. In former versions of the clustering, self-counting was included and cnn_cutoff = 2 is equivalent to cnn_cutoff = 0 in this version.
purge – If True, force re-initialisation of predicted cluster labels.
-
reel
(self, depth: Optional[int] = None) → None¶ Wrap up label assignments of lower hierarchy levels
- Parameters
depth – How many lower levels to consider. If None,
all. (consider) –
-
summarize
(self, ax=None, unicode quantity: str = u'execution_time', treat_nan: Optional[Any] = None, convert: Optional[Any] = None, ax_props: Optional[dict] = None, contour_props: Optional[dict] = None, unicode plot_style: str = u'contourf')¶ Generate a 2D plot of record values
Record values (“time”, “clusters”, “largest”, “noise”) are plotted against cluster parameters (radius cutoff r and cnn cutoff c).
- Parameters
ax – Matplotlib Axes to plot on. If None, a new Figure with Axes will be created.
quantity –
Record value to visualise:
”time”
”clusters”
”largest”
”noise”
treat_nan – If not None, use this value to pad nan-values.
ax_props – Used to style ax.
contour_props – Passed on to contour.
-
property
summary
¶
-
to_nx_DiGraph
(self, ignore=None)¶ Convert cluster hierarchy to networkx DiGraph
- Keyword Arguments
ignore – A set of label not to include into the graph. Use for example to exclude noise (label 0).
-
tree
(self, ax=None, ignore=None, pos_props=None, draw_props=None)¶
-
trim_shrinking_leafs
(self)¶
-
trim_trivial_leafs
(self)¶ Scan cluster hierarchy for removable nodes
If the cluster label assignments on a clustering are all zero (noise), the clustering is considered trivial. In this case, the labels and children are reset to None.